Predictive Problem

So, I’ve built a workflow on a 3 deseas prediction using the decision tree algorithm, but the accuracy and all the rest always give me 100%, and i dont know why, do not miss one. I want to predict 3 deseases. I’ve already did oversampling of my data, i’ve use 3 diferents dictionaries and its always 100%…
Can anyone help-me?| i will leave some prints of my problem

Seems your answer is somewhere in the Data. Can you do a simple correlation/crosstab for each variable in your final set? Do you have the Target in the Vector?


Yes, im a begginer but i think so, but i cant find the problem no matter what, even if i delete the dictionary the accuracy and the rest is always 100%

I have 3 diseases to predict, Diabetes Mellitus, Aterosclerose and Cirrose, based in the column Enquadramento 10.
This is a print of 1 of 3 excell, each one for a disease, with 4 columns each

It is hard to judge from a screenshot but if I interpret this it seems there are terms suggesting something like diabetes in a lot of lines. So if your other two diseases are something completely different the system might just conclude that somewhere in the text there is something like ‘diabetes’ it might well be diabetes and not Atherosclerosis.

So for this specific task the system might be correct. Or if you have three files and you also give the Row Number it might occur that they differ in a systematic way giving away the result not by some insight but by structure. Which would be useless in a future context.

So if the task was to find hints of diabetes in your text: congratulations you have a good model. But if that is too simple you might want to research further.

And I always advise to make a test of what you want in the real world. What would happen if you use the model on some new data, would it be correct?

Especially with medical tasks I would strongly advise to exercise some caution before applying such results in the real world and start treating patients or something if you do not fully understand your model and your data.

Tools like KNIME are just great but they might also be tempting to jump to quick conclusions before you have fully established what is going on and learn about the perils of building models. So please keep asking questions. And I would advise to discuss these results with someone who has some domain (medical that is in this case) knowledge.

1 Like

First thanks for helping me.
So this is for my University dissertation, an this is the first time that i work with text mining , im only used to data mining.
So i have many diseases, and i’ve changed many times the diseases to predict and the results are always 100% , i have already put the same data from aterosclerose in 2 inputs, one i told that the class is aterosclerose and in the other a put Neoplasia but the data was the same, and it stills gives 100% of accuracy, with this example it was suposed to miss sometimes right??
since the data were the same or not?

Sorry, and thanks

Could you describe what the data is containing and what you want to predict. I do not really understand what that is. The data seems OK at predicting the three diseases. But it is difficult to tell from a screenshot and I am not an expert in Text Mining. If you could upload an example we might be able to have a look, but I do not speak any Spanish I was just able to detect the words that seemed to be related to diabetes.

You should make sure the Document Name is not giving away information since it looks like it was derived from the Row-ID which could be specific to the file you concatenated. Then you would have a classic leak.

1 Like

I have 4 columns in each excell file, The columns are ID, Enquadramento10 and Diagnostico. So i want trought the text presented in the Enquadramento10 column predict the Diagnostico (Disease) if it is for example, Diabetes or Aterosclerose or Cirrose. The Enquadramento is when a health professional writes things about the pacient that is admitted in a intensive unit care, and i want, trought that, predict what could be the disease.
i think that the problem is in the text process, im around this for days and i cant figure it out. I will upload my workflow if anyone could give a look i would be much apreciated.

Im sorry for my bad english, and much thanks —> workflow

Seems at first glance that the text data often contains the relevant disease keywords in the patient’s log. Are you predicting the obvious?

Maybe you should refocus the business question: find other predictor keywords in the data to identify any given disease, i.e. other than the disease keywords themselves.

For example, you could try TFxIDF rather than TF - this will reduce the weight of quasi stop words such as diabetes. It will also boost rarer terms.

More fiddly, you could build a custom dictionary to filter the disease keywords from the text before building the real predictor terms dictionary. You could do this after stemming to have a shorter list.

1 Like

Thank you for the data. You have a classic tautology in your data. The model simply takes the variable “Document class” and that contains the information, so it is seemingly 100% correct. You have to get rid of the column.

I do not fully understand what the workflow does. Maybe you want to check this example and see if it can help you:

Is there any specific reason for the SMOTE node after each single import. Typically it would be used to balance classes on a full data set. Not sure what it does with a single category.