I'm having trouble using the Python Learner and Predictor Nodes. (Issues might alos apply to oder Python nodes, have not tried yet.)
1. Is there an easy way to access a fingerpirnt column within the Python nodes?
Seems they are not added to python as bitarrays or in generaly arrays of 0s and 1s.
What works is
featureList = [] for fpString in input_table['Fingerprint'].values: fp = np.array(map(int, list(fpString))) featureList.append(fp) features = np.array(featureList) #... output_model.fit(features, lables)
Another option is to pass in SMILES and calcuate Fingerprint using RDKit withn the learner node. But this isn't ideal either.
Also when just copying the tables in the output the fingerprint column then is a string column.
EDIT:
While writing this there is a 3rd posibility: using the Expand Bit Vector node prior to the Python Nodes. This is probably the best approach. Ideal would be auto-conversion between java bitvector and python numpay array.
2. Memory consumption
Is huge. See attached image from Visual VM. And this workflow has just 29 rows. In case of more data I can't use it as KNIME always crashes (can't use more RAM,32-Bit Windows). Still I think there might be a prroblem with these nodes as running the same model in stand-alone python is not a problem, takes about 160 mb of RAM for the workflow that fails in KNIME. In the attached image you can see when the Python Predict Node is executed memory usage jumps up 700 mb. This is also notiecable as KNIME gets really slow, saving takes forever or it may crash. Note that WEKA Nodes do suffer from the same issue. Maybe the Problem are the many columns? Regardless, I think there is a general problem with memory usage.
It would be great if scikit-learn could be used from withn KNIME but with these current limitation it is just not really possble and less hassle to go pure python.