I am trying to build a conformal regression model for solubility prediction. The molecules in the data set are described by RDKit descriptors and fingerprints. I based the workflow on Figure 11 in this paper:
Everything works fine until the very end, when the Conformal Prediction Loop End nodes stalls and reports:
ERROR Conformal Prediction Loop End 3:1075 Execute failed: Maximum unique values number too big
I am attaching my workflow for reference. Please advise.
I took a look at you workflow. And actually I managed to run you workflow without any issues.
During execution I noticed that RAM consumption increased a lot (up to 45 GB) when I was running the final loop. I believe this is an issue. This way I could recommend you to reduce the number of iterations in the final loop, letās say from 10 to 3. Another thing you can try is to increase max Java heap parameter in knime.ini file.
I would also kindly ask you to provide the full stack trace of the error. You can find it in View ā Open log, the scroll to the bottom and find relevant error message there. It could be useful to see if it is possible to optimize the nodes.
I hope it was helpful please feel free to get back with any other questions.
Thanks for your swift response. I increased the Xmx parameter to 32 Gb and now it runs all the way through, without any errors, even if I do 10 cross-calibration folds. It does use a lot of memory though which is a bit surprising since I would guess the calculations are not very demanding?
I did some tests with your workflow. Indeed, the the calculations for conformal regression are not demanding. However the tables you operate contain 1000+ columns, what makes the general size of the table huge. I am pretty sure this is the reason because in my test I deactivated checkbox āKeep all columnsā, instead I selected Structure column as an ID in the settings of Calibrator and Predictor nodes. This way I significantly reduced the size of the data set, and also the RAM consumption was reduced by 3 times.
Of course you can use any other columns as IDs, perhaps you could even generate them with RowID node.
Deactivating āKeep all columnsā indeed makes it much faster to execute. However this will also remove the experimental data column (here the solubility) which prevents me later on to calculate the Conformal Scorer (Regression) and Numeric Score statistics, or to plot experimental vs predicted values.
In that case you can do the following: keep only ID column to reduce the size of the data set. But after the loop you can join the loop output with the rest of the columns by this ID column. This way you can still used Conformal Scorer node to estimate the predictions.
Of course, rejoining with original data would do the trick.
Given the workflow with cross-calibration, would it be possible to write the conformal model to a file for later use, i.e. read it into a separate workflow for predictions, like you can with e.g. an RF model?
Of course, conformal prediction node are compatible with full cycle for ML.
In the most simple way you can just save 2 tables that Conformal Calibration Loop End node produces: the calibration tables and the models that are serialized into tables.
In the prediction workflow you can read these tables and start Conformal Prediction loop with providing new or test data.