Hi All, Can anyone let me know how to solve the problem with my predictor node which shows the error that “Learning column XXX not found in the input data to be predicted”. I have created a model for text processing and then used decision tree to train the model to categorize the documents into two categories. The first part has worked perfectly with 98.9% accuracy. However, when i wanted to use the same learning model on new data it is showing error as described above. Can anyone please help me to solve this issue. I am sharing a link to download my KNIME workflow. Look forward to replies. Thanks much.

I attached the whole workflow in a slightly new version. Now also including the XGBoost and H2O models. Maybe at some point you could elaborate about your Document preparation (now in the Meta Node) - that could be illustrative for other people too. H2O gives no better Accuracy but GBM could provide you with a list of variable importance. That can be useful in checking if the whole thing makes sense. For example if a variable would show up here that might contain a ‘leak’ you might notice. [image] For the XGBoost I also added the scoring of new data from the m_001 workflow with the original decision tree. [image] I am a little bit obsessed with the preservation of IDs because if you want to bring such a thing into production question will always be to identify the cases/customers and often you have to match that back to some external data source. So please take extra care about IDs, customer numbers etc. [image] kn_example_document_prediction.knar (3.8 MB)

Urgent - What is wrong with my decision tree predictor for new data

KNIME Analytics Platform

mlauber71 January 27, 2019, 11:38am 4

These things:

you will have to do the whole preprocessing you did for your training also for your new data, otherwise the model will not have the same structure and cannot be applied
then you are not splitting your data into test and training for the development of your model. So the very high score is not very useful since both deal with the same set of data
and you will have to make sure that the answer is not somewhere encoded in the data (which will not be present in any future data you might want to score)

I will have a look and see if I can find a fix.

If you want to read about Yes/No models you can follow these links and will also find some example workflows you could use (the data preparation would still have to be yours)

Understand metrics like AUC and Gini (and use H2O.ai)

Models for Multiclass Targets:

2 Likes