I successfully set up chemprop for deep learning applications within drug discovery. Now I would like to be able to use it within Knime workflows by using Python scripts.
To train a model chemprop uses the following code:
chemprop_train --data_path --dataset_type --save_dir
So the date needs to be fed via the --data_path parameter. Using Python code, how could I replace this by using an input table in Knime as the data source instead?
Thanks for bringing this subject to the KNIME forum which sounds very interesting for the field of cheminformatics.
I had a look at the links you mentioned and as far as I see, the calls to this Python library are done based on command line calls (as stated in the ChemProp Tutorial Data section). In this case you are not obliged to call it through a Python script but any node allowing command-line execution would do the job.
Recently, I posted a solution based on Java (and hence with no dependency on Python) to do command-line calls which can recover the output terminal results generated by the command line:
(It was used to ping URLs but can be used for any other command-line use as i.e. it is required in your case)
Nevertheless, in your particular case I see that you would need first to save the table you want to pass as a CSV file and secondly to read the results as a CSV file too once your command line has finished its execution.
The solution I implemented waits until the end of the execution and tells you in the end if everything went ok with command-line outputs and errors if any as well. It could be an option for your need here.
Otherwise you could always do the same using a Python Script node but I do not see the advantage since what you execute is not a Python code but a command-line program. Other options are available as stated by @ScottF in the same thread.
I’ll be happy to further help if needed.
Hope this already helps
OK, thanks for clarifying, I will look into this.
FYI, I had difficulties getting chemprop to work using the gpu of an RTX 3060 graphics card but I managed to solve it by installing cudatoolkit 11.3 with a nightly build of pytorch (1.11.0.dev20220130 py3.8_cuda11.3_cudnn8.2.0_0 pytorch-nightly) inside the chemprop conda environment. The built-in web interface is quite nifty as well.
There are also the External tool and External tool (Labs) nodes. They can sometimes be useful. However it usually implies you need control over the executable called (or better said it’s output). Therefore calling external tool from java snippet or python script node is probably preferable as it give more direct control.
Having said that I always got better results with xgboost at a fraction of the training time. And if I left away rdkit 2d descriptors from chemprop, it usually did a lot worse so even the neural network seems to rely on them heavily.
If you want to do proper optimization with cross-validation with chemprop, an rtx 3060 might still be a limiting factor assuming you have tens of thousands of rows to train. (as a general statement I wouldn’t even bother with less than 5000 rows).
I would for sure compare any result to xgboost (or equivalent) before deciding it the resulting model is useful or not.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.