With a 60 x 20,000 table, transfer between python & R scripts is very slow.
Also, some built-in nodes are slow (e.g. Linear Correlation), and once completed, worksheet save will also hang or otherwise fail.
I have set java heap to 32GB (half physical RAM), which should be plenty given that the correlation table should only be about 2.7 GB even if stored as a full matrix.
I'm wondering if this is typical, or is something not right with my setup?
For R and python:
The table data is serlialized to disk meaning you will be limited by disk perfromance. Only thing you can do to improve is: get an ssd, get a faster ssd, use a RAM disk. For doing HPC stuff with knime I say having a fast disk is very important. HDD will kill your performance here.
Linear correlation is lsow due to time-complexity. With 2 columns you need to do 1 calculation per row. With 3 columns you need to do 3 caluclations per row, with 4 columns it's 6 calculations and so forth. So complexity rises exponetilally with number of columns and linearly with number of rows. In this case it means 1770 possible combinations. So depends what slow means but it for sure nevr will be very fast and mostly will depend on you CPU.
Still I thinkt that node could profit from multi-threading (many nodes could). Each column pair could run in a separate thread. downisde being that in easy cases this would be slighlty slower due to the overhead but yeah there is certainly potential here to improve perfromance.
Another option is to reduce number of columns before the linear correlation if possible. Maybe by using the low variance filter in case the background here is machine learning.
I’ve started seeing very slow python scripting again… table is 800 x 25000.
Stand-alone python on same data takes ~600 seconds, but Knime script is taking hours.
Knime 3.5.3, Win 10 + 64 GB RAM.
Temp file directory is already set to SSD (300 GB free).
Just to be sure 800 is columns and 25000 is rows?
Difference might be that in knime everything is serialized back and forth which is simply slow. Plus it uses pandas. maybe if you use pure numpy in python it’s faster vs pandas.
One option you could try is to change serialization to Apache arrow for which you will need to install pyarrow in your knime python env. Still, you will never loose the serialization penalty compared to using python directly. And again linear correlation on high dimensional data is simply just slow.