Python Learner and Predictor Node: Issues and Memory consumption

I'm having trouble using the Python Learner and Predictor Nodes. (Issues might alos apply to oder Python nodes, have not tried yet.)

1. Is there an easy way to access a fingerpirnt column within the Python nodes?

Seems they are not added to python as bitarrays or in generaly arrays of 0s and 1s.

What works is

featureList = []
for fpString in input_table['Fingerprint'].values:
	fp = np.array(map(int, list(fpString)))
features = np.array(featureList)
#..., lables)

Another option is to pass in SMILES and calcuate Fingerprint using RDKit withn the learner node. But this isn't ideal either.

Also when just copying the tables in the output the fingerprint column then is a string column. 


While writing this there is a 3rd posibility: using the Expand Bit Vector node prior to the Python Nodes. This is probably the best approach. Ideal would be auto-conversion between java bitvector and python numpay array.

2. Memory consumption

Is huge. See attached image from Visual VM. And this workflow has just 29 rows. In case of more data I can't use it as KNIME always crashes (can't use more RAM,32-Bit Windows). Still I think there might be a prroblem with these nodes as running the same model in stand-alone python is not a problem, takes about 160 mb of RAM for the workflow that fails in KNIME. In the attached image you can see when the Python Predict Node is executed memory usage jumps up 700 mb. This is also notiecable as KNIME gets really slow, saving takes forever or it may crash. Note that WEKA Nodes do suffer from the same issue. Maybe the Problem are the many columns? Regardless, I think there is a general problem with memory usage.

It would be great if scikit-learn could be used from withn KNIME but with these current limitation it is just not really possble and less hassle to go pure python.



Any insight on this?

After further testing also on 64-bit machine with 8 GB of RAM my conclusion is that the Python Predictor Node is the main issue. In my case I'm using the gradient boosting regressor and the amount of trees you use has a huge impact on the memory consumpion of the Predictor node. With 300 Trees (estimators) it barley runs through while using 6 GB of RAM. 

Also most of the work is not the actual predictions but probably loading or exporting (?) the model. The progress bar shows no percentage and there is no python process running but KNIME uses lots of CPU. Then for a very short period there actually is a Python process. My conclusion is the codeexport/loading of the model could be imporved in terms of memory and cpu perfromance.