The importance of Hard Drive Size & Speed in Azure and AWS Knime Servers.

Hi All,

I am a postgraduate student planning to use Knime server only on an occasional basis when I need the processing power of an Azure machine. The rest of the time the Knime Analytics platform on my PC is enough for my needs. As such, my interest is to keep fixed costs as close to $0 as practically possible, when I am not using the Azure server.

An Azure and AWS Virtual Machine uses several hard drives. Could Knime Technical Experts please elaborate on the importance of hard drive size and hard drive speed on each of them? I imagine that other current and potential users might face this same need, so answers from Knime Server experts on this topic might benefit a lot of current and potential users.

1) Non-Persistent (Temporary) Hard Drive - This is the hard drive that is paired to the Azure virtual machine selected. No costs are incurred when the virtual machine is not in use. Am I right to assume that it is important for this hard drive to be large and high speed (sdd) as this is what Knime Server uses for temporary files related to the processing of large datasets (RAM swap file)?

2) Operating System Hard Drive - While Azure provisions a 128GB hard drive for hosting the Knime Server Page File, i am wondering if such a large size drive is truly needed, when in fact the OS only uses a small portion of that drive. Other than catering for the fact that future versions of the Knime Server may grow larger, can Knime explain us why a 128GB server size was selected as the default, and what the rest of the hard drive space could be used for? Are workflows and datasets sent to the server stored in this drive or on the data drive? I would imagine the later, no? If this drive hosts the operating system only, and the data and swap files reside in other drives, is speed of this hard drive really needed or is a standard HDD (as opposed to SDD) enough? Sure it may take a bit longer to load the image, but once it is loaded, I imagine that hard drive speed wouldn’t play much of an impact, no? Feel free to correct me.

3) Data Hard Drive - Am I right to assume that the data hard drive is where Knime Server stores all the jobs sent to the server, and all the datasets attached to workflows? Data Hard Drive size would therefore be dependent on how many workflows and datasets one wants to store in the server for a long time. If one is only using the server to process workflows that require a lot of computer power, and one is not interested in storing things in the server once the workflow is processed, I imagine that one could have a small hard drive and just clean it up regularly, no? As to speed, would high speed (SDD) be needed or is HDD enough?

Thanks in advance for any insights into the above.

Paul

Hi Paul @witschey ,

Some good questions! The answer to sizing and hardware depends on what you are doing with KNIME, because it is workflow dependent. There is a Sizing section on the Azure documentation

With regards to Hard Drive Size and Speed I usually don’t work within Azure so forgive me if my answers are not Azure specific. The Azure Data Locations documentation does a good job outlining which directories you should optimize and backup. If you haven’t check it out please do, and let me know if you have any specific question that are left unanswered.

Generally speaking: Hard Drives usually aren’t the limiting factor when it comes to performance (unless the IO/s are really low or you run out of disk space). The nice thing about being on the cloud is you can usually always swap these out for better disks. In my experience, limited memory is usually the culprit of poor performance.

With respect to your specific questions:
I would recommend SSD all around. I don’t recommend really recommend anything less than 5000 IOPS.

The size of the Hard Drives can vary depending on your workload. Tomcat itself takes ~30GB. But, you do want to make sure the disk is large enough to handle anything extra. 50GB for Tomcat is usually enough.

The Data Hard Drive is the most important. It stores all of your workflows, jobs, and other important artifacts. This is what you want to bulk up the most where possible. You can reduce the amount of disk space required by discarding jobs. 250gb is usually good start and you can scale up/down depending on your workload.

Let me know if you have any other questions.

Regards,
Wali Khan

1 Like