Cheminformatics Workflow performs substructure count on all molecules in a training set , uses these counts to build a classification model for all groups as per a particular nominal column in a loop. The ML models are saved to disc.
I now want to do a substructure search on a test set and run ML model on millions of molecules. Since the number of nominal classes are a few hundred, it takes a few minutes for the prediction to complete for 1 molecule. So what would be the best way to transform this predictive workflow into a high throughput one?
Idea is to screen say 10 million molecules within a reasonable time frame of a few weeks at most if not days. Is it even possible in Knime, either using free nodes or some commercial license?
Other option is to convert it to hadoop /big data but that's a separate project in itself. How can I convert a cheminformatics knime wf to a high throughput one?