Conformal prediction loop end error

Hi,

I am trying to build a conformal regression model for solubility prediction. The molecules in the data set are described by RDKit descriptors and fingerprints. I based the workflow on Figure 11 in this paper:

Everything works fine until the very end, when the Conformal Prediction Loop End nodes stalls and reports:

ERROR Conformal Prediction Loop End 3:1075 Execute failed: Maximum unique values number too big

I am attaching my workflow for reference. Please advise.

Thanks/Evert

Solubility_Wang_2007_RF_conformal_cross_calibration_230131.knwf (135.1 KB)

2 Likes

Hello @evert.homan_scilifelab.se

I took a look at you workflow. And actually I managed to run you workflow without any issues.
During execution I noticed that RAM consumption increased a lot (up to 45 GB) when I was running the final loop. I believe this is an issue. This way I could recommend you to reduce the number of iterations in the final loop, letā€™s say from 10 to 3. Another thing you can try is to increase max Java heap parameter in knime.ini file.

I would also kindly ask you to provide the full stack trace of the error. You can find it in View ā†’ Open log, the scroll to the bottom and find relevant error message there. It could be useful to see if it is possible to optimize the nodes.

I hope it was helpful please feel free to get back with any other questions.

2 Likes

Hi Artem,

Thanks for your swift response. I increased the Xmx parameter to 32 Gb and now it runs all the way through, without any errors, even if I do 10 cross-calibration folds. It does use a lot of memory though which is a bit surprising since I would guess the calculations are not very demanding?

Best wishes/Evert

Hello Evert,

I did some tests with your workflow. Indeed, the the calculations for conformal regression are not demanding. However the tables you operate contain 1000+ columns, what makes the general size of the table huge. I am pretty sure this is the reason because in my test I deactivated checkbox ā€œKeep all columnsā€, instead I selected Structure column as an ID in the settings of Calibrator and Predictor nodes. This way I significantly reduced the size of the data set, and also the RAM consumption was reduced by 3 times.

Of course you can use any other columns as IDs, perhaps you could even generate them with RowID node.

I hope it will be helpful for you.

Deactivating ā€˜Keep all columnsā€™ indeed makes it much faster to execute. However this will also remove the experimental data column (here the solubility) which prevents me later on to calculate the Conformal Scorer (Regression) and Numeric Score statistics, or to plot experimental vs predicted values.

Best wishes/Evert

In that case you can do the following: keep only ID column to reduce the size of the data set. But after the loop you can join the loop output with the rest of the columns by this ID column. This way you can still used Conformal Scorer node to estimate the predictions.

Of course, rejoining with original data would do the trick.

Given the workflow with cross-calibration, would it be possible to write the conformal model to a file for later use, i.e. read it into a separate workflow for predictions, like you can with e.g. an RF model?

Thanks/Evert

Of course, conformal prediction node are compatible with full cycle for ML.
In the most simple way you can just save 2 tables that Conformal Calibration Loop End node produces: the calibration tables and the models that are serialized into tables.
In the prediction workflow you can read these tables and start Conformal Prediction loop with providing new or test data.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.