Small performance hack for R nodes

As I understand, the R input dataset for a R node is read from a CSV file generated by the previous node. This is somewhat inefficient, as data frames in R are “column-wise” rather than “row-wise” as it is the case for other nodes in KNIME.

There is some room to efficiency improvements if you change this schema slightly. Instead of a single CSV file you can generate n --n being the number of columns in the CSV file-- text files containing a single column each.

Then, instead of submitting a single read.table statement to read the whole CSV file, you can get the same dataframe doing something similar to

do.call( cbind, sapply( dir(), function(x) read.table( x, header = T ) ) )

Here, you need R to set its working directory to the directory that contains only the column-csv files.

I did some tests based on the dataset at

http://www.cs.utexas.edu/users/pstone/Workshops/2004icml/GenderTrainingSet.zip

and I got systematic 20-25% speed gains.

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com