My question is basically the title. I’ve noticed for large data sets executing my R code in KNIME takes significantly longer than doing it in RStudio. I’ve set the number of rows to send to R per batch to 1,000,000 and I’ve set my RServe buffer size limit to 0 (unlimited), but I’m wondering if there is anything else I can do to close the execution gap.
could you measure the time your script requires to execute. I would expect that the execution time stays more or less the same, however since Knime has to transfer the data to R and read it back afterwards there is a non neglectable overhead and therefore the total execution time of the node is higher. The larger the data you need to pass around the higher the time difference.
You could try and use data.table instead of data frame to process the data. It is marked “experimental” but says it could be more memory efficient.
Then you could try and store the data as parquet and read it back in within the R node. Although I am not sure this will help in your case.
Then you could reduce the columns to only those you would need if you have not already done so.
Then I had processes with R where the data was way too large so I used a loop to work in chunks (if your task allows that). It took longer but was more stable in the end. You could try and combine that approach with telling KNIME to do everything with the R node in memory. Again I am not sure this will have the desired effect.
Another idea is to use a cache node right before the R node - I have some jobs on KNIME server where this increased stability. I tried combinations with processing the data forcing KNIME to do that in memory and then the cache node holds the data the standard way (combination of memory and storage on disk).
And welll fast SSD and more memory never hurts. Do you have a windows or Mac or Linux machine?
When going to R or Python, all the data needs to be serialized which means written to a shareable file format and then read in again by R and vice-versa. This is extremely slow as it involves IO and only current solution is to to as much as possible inside KNIME.
To improve the situation, there is already my request to have a Java Snippet node that allows random access to full table just like R and Python nodes allow. This would allow to remain in KNIME (the JVM) for "calculations or transformations that need access on the full table.
A second solution, which honestly I’m not 100% sure it’s possible, would be much more involved and require to completely change KNIMEs core, namely how the tables are stored in file and memory. As a “naive” person not knowing the intricate problems I would see a ideal solution to store the tables off-heap in an Apache arrow* format. Why? Because then theoretically no serialization would be needed as Python or R nodes could read and write directly from the exact same data as KNIME does (in fact that is the core problem Apache arrow seeks to solve, zero-copy inter process communication).
@Mark_Ortmann I will measure and get back to you. On a related note, it looks like the total number of rows you can send to R is capped at 1,000,000. Is this intentional?
You can enter a higher value, but after you click Apply/OK, it will reset back to 1,000,000. I think this is a problem for larger data sets, and especially doesn’t make sense if you have enough memory to handle more rows.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.