In a desperate attempt to surpass my current computational resource limitations, I made a cloud experiment few days ago using Knime AWS cloud offer and compared the running time results to those on my machine.
My machine has 32gb of RAM, an 8 cores CPU (AMD 3.7Ghz x2700) and SSD type 2 disks.
As the workflow I tested relies heavily on parallel chunk loops, I thought that more threads and RAM should result in faster processing time. I have tested the workflow on AWS cloud using different machines, some with slower CPU clocks than mine, but all with (a lot) more threads and RAM, using only SSD (r4.8xlarge, z1x12 large, c5.12xlarge).
I ensured for each test that (1) the heap size was appropriate and matching the RAM available on each server, (2) the number of threads in Knime was configured properly to match the server, (3) made several runs per machine changing the max cell per table in knime.ini etc…
I was surprised to realize that all tests have higher running time than on my machine, except the one on z1x12 large server which is comparable. I thus came to the non-intuitive conclusion that more threads and ram does not improve the running time of my workflow.
Did anybody experience the same disillusion?
Can anyone provide me with an explanation?
Thanks in advance for your help.
did you by any chance check ‘Use automatic chunk count’?
There seems to be a problem with the automatic detection concerning the number of available CPUs with AWS. See also here Azure VM - KNIME Server for a deeper explanation.
The general rule we’re using to calculate the number of threads is
1.5 * #available Cores (rounded up)
Whereas cores with hyperthreading count as 2 cores.
I have just reviewed my manual notes and I did log the difference between Automatic and Manual chunk definition, along with other variables!
I have seen minor improvements between runs on the same machine with Manual chunk definition compared to automatic. I have also logged/ confirmed that the runs with Automatic definition of chunk number were creating a large number of parallel legs. Thus, I did not experience the problem mentioned in the thread you attached.
However, in no case the Manual definition of chunk numbers has made the running time on such large cloud instances faster than my home computer.
I am, of course, still interested in knowing if others have faced the same problem.
However, from the Knime Team’s perspective, I think it is crucial to solve that mystery: the relevance of a Knime cloud offer is badly damaged if improvement in running time are impossible to achieve by moving to the cloud.
It would also be great to know if others have managed to achieve substantial running time improvements in the context of parallel chunk workflows by moving to the cloud. It would at least contain the problem to my specific workflow rather than Knime cloud offer altogether.
All help is appreciated,
I recently exchanged in Impact of different OS on Knime workflows with another Knimer about this. Maybe worth to read through it.
As a brief recap. Overdoing it with too much parallelism will create a lot of bottle necks. It is less about cores, threads and memory but more about IOPS and throughput.
Each Knime node caches its data in the workflow directory. This can easily grow into several dozen GB. Using Don’t save Start / End nodes might help. You might also configure Knime in the ini-file to better leverage the memory, save data uncompressed.
Diving very complex workflows into smaller ones, calling each one in combination with a garbage collector or make use of streaming execution are other good way to manage system resources.
My personell remark, optimize your workflows. I frequently run through five or more iterations to come to new, interesting and novel approaches. I.e.pivoting, joining or multi-rule evaluations are very computational demanding. Unpivoting greatly decreased complexity. Always ask yourself “How can I divide the problem into even smaller pieces”.
It looks like you’ve tried variants of CPU and memory configurations. Another dimension to test is disk I/O. The default IOPS (I/O operations per second) an EBS volume provides is 100 IOPS. That’s approximately equivalent to a single 7200 RPM SATA drive. The SSD drives on your local machine are much faster. To get an EBS volume equivalent to your SSD drives, you’d need an EBS volume with at least 5000+ IOPS. Here’s an article that provides more detail: https://www.datadoghq.com/blog/aws-ebs-provisioned-iops-getting-optimal-performance/
I suggest you start an EC2 instance using one of the instance types you’ve tested already, but this time increase the IOPS on the disk configuration page/tab.
If you do this experiment, please let me know the results. I’m interested to hear if it helped in your use case.
Thanks for your input.
I have just tested the same component on a large machine with multiple CPUs similar to those on AWS cloud I used before, but with NVMe SSD drives this time. The result is the same (large loss of performance) and seem to validate my observation on the cloud that IO was not the bottleneck.
Anyway, thanks for your link, it was instructive.
Hi @nba ,
what is your actual cloud stack? Depending on the EC2 instance type you might have CPU credits to been build up too. Did you tried AWS WorksStations?
From my perspective it starts to look like a workflow optimization task. Maybe you parallelize too aggressive? Also, did you checked the EBS / EC2 dashboards?
In case you’ve got a AWS business support plan, which only incurs 20 % or so on the monthly expenses in addition, the AWS support is pretty supportive. In contrast AWS forum support is a black hole …
I have never run knime on AWS but yes I agree that IO could be the issue.
Another point however is the parallel chunk loop. I stopped using that years ago as it never ever seems to really help. When the loops are set up there is a ton of copying going on (IO heavy!!) and with lots of data that often was slower than just running it serial. Maybe it has improved but I wouldn’t be very confident that that node is doing what you think it does. Plus many nodes that can be parallelized have been in the recent years. I would report any node that isn’t but should be to knime or the plugin provider.
Depending on what the workflow does, streaming might help much more than parallel chunk loop does and yeah it will save on IO as well.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.