Hi @badger101
Thanks for your answers and information you provided to better understand the problem.
If I understood well the logic of your process:
- You know the rules that were used to label the samples into two classes because you built them based on a given set of variables that I call from here “former variables”.
- I guess the rules that generated the underlying sample classification are simple and pretty similar to those of a decision tree, i.e. “If Variable > Value then … else …”
- You want to use a Decision Tree (DC) to solve this classification problem but this time based on variables that are not those that were used to establishe the initial rules to class the samples. I will call them “latter variables” from here.
- You guess that the words extracted from text associated to the samples should be in theory informative enough to establish new sensitive rules inferred using a DT. However, so far it doesn’t sound to be the case because the DC performance (Accuracy & Kohen Kappa) based on these different variables (words) is near random.
If all the above is true, then I would start evaluating the quality of your latter variables (words) based on a correlation measure with the former variables that were used to generate the underlying rules and eventually label the samples into two classes.
You would need for this to do the following: For every latter variable, calculate for instance the Pearson Correlation Coefficient (PCC) w.r.t. every former variable to check whether the latter are informative enough to be used as variables to train your DC instead of the former. It would be the case if for all the former variables, you find at least one latter variable with PPC value higher than +0.5 or lower than -0.5. If so, these are therefore the latter variables that are significant to be used in your DT. This is a kind of supervised latter variable selection based on the former variables. You could for instance shortlist here the most correlated latter variable to every former variable if they are above a given PPC threshold and see whether you can end up with the same number of latter variables. If this is the case, maybe you will be able to infer a bit of explanation from this correlation between key words and your initial rules and variables.
Note: The beyond +/- 0.5 threshold is a hard one set here but you can refine it later. Other metrics are possible too as for instance the Tetrachoric correlation but PPC should be good enough to start with.
B.t.w. the -Linear Correlation- node can handle several linear metrics, i.e. Pearson and Nominal (which should do too instead of tetrachoric):
I would also try the opposite, i.e. use your former variables to directly train a DT and then once it is trained, check its performance and extract the rules to compare them to the ones you set up initially for the class labeling. They should be very similar if the original rules were of type “If Variable then … else …” as stated before. This second test is an interesting exercise to learn how a DT works.
Hope these hints are useful @badger101
Best wishes,
Ael