Urgent - What is wrong with my decision tree predictor for new data

Hi All,

Can anyone let me know how to solve the problem with my predictor node which shows the error that “Learning column XXX not found in the input data to be predicted”.

I have created a model for text processing and then used decision tree to train the model to categorize the documents into two categories. The first part has worked perfectly with 98.9% accuracy. However, when i wanted to use the same learning model on new data it is showing error as described above.

Can anyone please help me to solve this issue.

I am sharing a link to download my KNIME workflow.

Look forward to replies.

Thanks much.

Hi,

There is a column named “XXX” in your training data set which is now missing in your current data set to which you are going to apply the prediction model.

Best,
Armin

Hi Armin,

I am sorry for phrasing the sentence wrong. It is actually saying that “Learning column ANG II receptor not found in the input data to be predicted”. When i deleted that column from learning it is showing some other column as not found.

Basically i have created a document vector for my training data which is used for decision tree learning. Now, i want to use the same model for prediction on new data.

I may be wrong but what i understood is the vector created by the document vector during learning process and during prediction process is not the same and so there is some discrepancies with the columns.

So, i don’t know how to overcome/correct this kind of problem. I have already given link to download my workflow so if you can look at it you might understand clearly what i am trying to say.

Please suggest.

These things:

  • you will have to do the whole preprocessing you did for your training also for your new data, otherwise the model will not have the same structure and cannot be applied
  • then you are not splitting your data into test and training for the development of your model. So the very high score is not very useful since both deal with the same set of data
  • and you will have to make sure that the answer is not somewhere encoded in the data (which will not be present in any future data you might want to score)

I will have a look and see if I can find a fix.

If you want to read about Yes/No models you can follow these links and will also find some example workflows you could use (the data preparation would still have to be yours)

Understand metrics like AUC and Gini (and use H2O.ai)

Models for Multiclass Targets:

image

2 Likes

Hi Armin,

I am sorry for phrasing the sentence wrong. It is actually saying that “Learning column ANG II receptor not found in the input data to be predicted”. When i deleted that column from learning it is showing some other column as not found.

Basically i have created a document vector for my training data which is used for decision tree learning. Now, i want to use the same model for prediction on new data.

I may be wrong but what i understood is the vector created by the document vector during learning process and during prediction process is not the same and so there is some discrepancies with the columns.

So, i don’t know how to overcome/correct this kind of problem. I have already given link to download my workflow so if you can look at it you might understand clearly what i am trying to say.

Please suggest.

Hi mlauber71,

I have tried splitting the data and also did whole preprocessing to my new data but still it shows the same error. This time it is saying “Learning column prelosartan not found in the input data to be predicted”

I have shared the link to my modified workflow.

As i am new to KNIME i still dont know how all these nodes work. So please bear with me.

Thanks for your support.

Hi Bubly0826,

I changed your workflow and I hope I came up with a solution. I have adapted your Text preparation and stored the Document Vector as a model which you used for text preparation. I also did the split into training and test. The score still seems to be quite good (0.965). But please put such accuracy into the perspective of your business task. If you want to sell insurances and you have such an accuracy they would name the corporate headquarter building after you, if you treat patients with an experimental drug and 0.035 of them die they might put you in jail.

I have not fully checked all your text preparations, I just tried to make sure it does not ‘leak’ any specific information about the category that would not be present in any future texts. You still might want to check if all these preparations are good (I am not an expert in text mining).

Here is the workflow in a reset state. The File “newly_scored_data.xlsx” is your new data with added columns of Category prediction and a numeric score (how sure the model is it has the right category)

kn_example_document_prediction.knar (3.1 MB)

You will have to add your documents into the /data/ folder and see if that works on your machine.

image

You might want to toy around with a few additional models (H2O) although your score seems to be pretty good. You could use the pre-prepared data and store them into KNIME tables and then use them to toy around with additional models.

2 Likes

If the data preparation is any good (please check that since you seemed to ave put a lot of effort into it) using the latest star in the modelling universe XGBoost you can -well- boost your Accuracy to 97.941% - and yes at a certain point overfitting might set in.

m_010_xgboost_tree.knwf (190.4 KB)

xgboost_tree_model.zip (156.4 KB)

2 Likes

I attached the whole workflow in a slightly new version. Now also including the XGBoost and H2O models.

Maybe at some point you could elaborate about your Document preparation (now in the Meta Node) - that could be illustrative for other people too.

H2O gives no better Accuracy but GBM could provide you with a list of variable importance. That can be useful in checking if the whole thing makes sense. For example if a variable would show up here that might contain a ‘leak’ you might notice.

For the XGBoost I also added the scoring of new data from the m_001 workflow with the original decision tree.

I am a little bit obsessed with the preservation of IDs because if you want to bring such a thing into production question will always be to identify the cases/customers and often you have to match that back to some external data source. So please take extra care about IDs, customer numbers etc.

kn_example_document_prediction.knar (3.8 MB)

4 Likes

Hi mlauber71,

I really appreciate your time and effort in solving my issue. For a starter like me in KNIME, this is a lot of stuff to digest. I shall look into each of the modifications/solutions suggested by you and come back to you.

Thanks a ton for your help. I really really appreciate your help.

Thank you so much.

1 Like

Glad if I could be of any help. Take a good look at the workflows and see if the results help you solve your issue. And again: check and maybe explain your text preparation beacuse the whole ‘magic’ rests on that :slight_smile:

It is also good to check a few random items you know the real answer by hand or let some experts check it. Model building shifts more and more from high fancy statistics into software engineering, at least for a lot of people who are not at the forefront of developing new algorithms it seems. But understanding your data and asking the right questions and interpreting the answers will not go out of fashion any time soon.

In your example now you have two categories and you measure an accuracy just by the prediction. If you move further you might have multiple classifications then I would advise to look at metrics for multi classification problems like Log Loss since it also takes into account how confident/close the prediction was. If you have time you might want to read these entries:

1 Like

Hi mlauber71,

The models which you shared have solved my problems. I am very grateful to you. Thank you so much.

I am trying to take this to the next level for my study wherein i definitely need your help or may be you could suggest something to me so that i can move forward with my requirement.

Actually, i need to classify/tag single document into multiple categories/buckets.

I have attached an Excel file wherein you can see that there is a column named “Data” which contains textual information. Based on some part, may be a sentence, of the whole text under column “Data”, the document is tagged under each of the categories such as Category A to Category E. Each of these sub text is a part of the corresponding whole text.

So, i want to create a model wherein i can train the model for categorizing single document into multiple categories. The learning model will be using the Excel with the same format (as attached) but with actual textual data. When i give the test data wherein my test data Excel sheet will contain only two columns “Unique ID” and “Data”, the model should be able to categorize the document into categories A to E.

Sample Data.xlsx (9.6 KB)

Thanks

1 Like

Glad I could help you with the models. For the next question it would be good if you could elaborate further on what the nature of the data will be. Are we dealing with exact sentences/phrases from the whole text or are they just patterns that can appear in different form. That might well influence the path forward. If they are exact phrases then why not just search for them in order to characterise the outcome.

In that case they would just be the category. Or you could save the patterns and see if you could first try to use some sort of subset matcher or similarity score to create a further input for the model that you would be able to reproduce in a real world scenario like:

  • Saving the subset string with their category, use a similarity score or subset matcher (or both) and see how good that would work with regards to the original string
  • use that result as an input for the ‘real’ classification model where you give that score and the assumed category from the subsets (or a score for each category) and then the real category to train the thing

Challenge would then be to implement that on any new dataset. And you would have to rely on the assumption that the phrases would reappear in a similar pattern again.

Would make sense to have an example that would actually represent the challenge at hand.

And also someone with more experience in Text analytics might pitch in and share some thoughts. Could be a new forum entry.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.