Learning Machine Learning: Help with Decision Tree

aworker · August 25, 2022, 11:57am

Thanks for your answers and information you provided to better understand the problem.

If I understood well the logic of your process:

You know the rules that were used to label the samples into two classes because you built them based on a given set of variables that I call from here “former variables”.
I guess the rules that generated the underlying sample classification are simple and pretty similar to those of a decision tree, i.e. “If Variable > Value then … else …”
You want to use a Decision Tree (DC) to solve this classification problem but this time based on variables that are not those that were used to establishe the initial rules to class the samples. I will call them “latter variables” from here.
You guess that the words extracted from text associated to the samples should be in theory informative enough to establish new sensitive rules inferred using a DT. However, so far it doesn’t sound to be the case because the DC performance (Accuracy & Kohen Kappa) based on these different variables (words) is near random.

If all the above is true, then I would start evaluating the quality of your latter variables (words) based on a correlation measure with the former variables that were used to generate the underlying rules and eventually label the samples into two classes.

You would need for this to do the following: For every latter variable, calculate for instance the Pearson Correlation Coefficient (PCC) w.r.t. every former variable to check whether the latter are informative enough to be used as variables to train your DC instead of the former. It would be the case if for all the former variables, you find at least one latter variable with PPC value higher than +0.5 or lower than -0.5. If so, these are therefore the latter variables that are significant to be used in your DT. This is a kind of supervised latter variable selection based on the former variables. You could for instance shortlist here the most correlated latter variable to every former variable if they are above a given PPC threshold and see whether you can end up with the same number of latter variables. If this is the case, maybe you will be able to infer a bit of explanation from this correlation between key words and your initial rules and variables.

Note: The beyond +/- 0.5 threshold is a hard one set here but you can refine it later. Other metrics are possible too as for instance the Tetrachoric correlation but PPC should be good enough to start with.
B.t.w. the -Linear Correlation- node can handle several linear metrics, i.e. Pearson and Nominal (which should do too instead of tetrachoric):

I would also try the opposite, i.e. use your former variables to directly train a DT and then once it is trained, check its performance and extract the rules to compare them to the ones you set up initially for the class labeling. They should be very similar if the original rules were of type “If Variable then … else …” as stated before. This second test is an interesting exercise to learn how a DT works.

Hope these hints are useful @badger101

Best wishes,
Ael

badger101 · August 25, 2022, 12:35pm

Thank you @aworker.

Is it possible that you can show this with a workflow? Different people learn things differently. I’m a visual learner, so I understand things better when looking at images (i.e. workflow). Does this require a different dataset than the last one I provided to you? If so, would you be kind enough to create a dummy dataset & explain this by showing a workflow? Sorry for troubling you.

aworker · August 25, 2022, 12:46pm

Hi @badger101

No problem at all. I will build a simple workflow to compliment my text

I’ll be back soon

Best
Ael

aworker · September 4, 2022, 8:45am

Hi @badger101

Sorry for my late reply. I have been working out of office during the last two weeks with almost no time to dedicate to the KNIME forum.

As you guessed in your last answer, indeed this algorithm needs the original data that was used to determine the classes of your final data set. Inventing a similar data set on my side to provide you with an example is not easy and could be misleading. It would be much easier if you could share your initial dataset but as far as I understood, it cannot be shared here. Could you please shared it privately by email ? If so, please get it touch by email to share it and I will build the workflow from there.

Thanks & regards,
Ael

badger101 · September 4, 2022, 11:51am

Hi @aworker , I’ll send you an email privately. I think I can get it done in 6 hours; I’ll have to do some cleanups, some annotations to make the workflow understandable, and a major modification that changes the way each document is represented by the termspaces.

Is it alright if I reach out to you via your official pikairos email address (the one with the initials C.M.), or do you prefer other address?

Again, thank you for willing to help out!

aworker · September 4, 2022, 2:45pm

Hi @badger101

Yes, it is perfectly fine
Best
Ael

badger101 · September 4, 2022, 6:07pm

Alright, I just finished emailing the data/workflow to you. My email address is my real name, 10 letters with no numbers, starting with N.

Thanks a lot @aworker !

aworker · September 5, 2022, 6:28am

Hi @badger101

My pleasure. I replied by email too.
Talk to you soon !
Best

system · December 4, 2022, 6:29am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.