Learning Machine Learning: Help with Decision Tree

badger101 · August 21, 2022, 12:32pm

After few years of using KNIME, this is my first time expanding my knowledge to supervised machine learning models. This week, I picked the Decision Tree to begin my journey with. Would appreciate if you guys can help with my “Mr. Accuracy” and “Mrs. Cohen”. It’s weekend now - I don’t expect a lot of people online to help, so I’m appreciative of any help at all from anybody who’s viewing this thread.

Here’s how my data looks like. It’s just a simple data merely to have something to work with for my journey’s beginning:

The first column is the assigned class (dependent variable), while the rest are the available terms in each document (1=available, 0=unavailable). The goal is to predict the assignment of new documents to either one of the two classes based on the term space.

My accuracy can’t reach & go above 0.6, and the Cohen’s kappa can’t reach 0.2 & above. Here’s the abridged version of the workflow showing the different attempts I made to build the model:

DT 1.knwf (2.1 MB)

One of the attempts I did in the workflow to try to enhance Cohen’s kappa was to use SMOTE (reference: this article).

The term spaces was the result of using the Document Vector node, as used in one of the KNIME’s Example Workflows. I didn’t demonstrate the preceding tasks in this abridged workflow.

In summary, I would like to know why my accuracy and Cohen’s kappa values are too low despite these attempts I made.

Thank you in advance for your time!

Daniel_Weikert · August 21, 2022, 2:48pm

Do you have any comparison? Might be the case that the features are not great indicators for predicting the class. The dataset is balanced so I do not see any use in SMOTE. You could try a different model and see whether any result changes
br

elsamuel · August 21, 2022, 3:11pm

In my opinion, most of the data you’re using doesn’t contain useful information for doing accurate classification:

A lot of your columns have very low variance
A lot of your columns are correlated with each other

My recommendation in general is to read up on techniques for feature selection, dimensionality reduction, and parameter optimization. The KNIME website has quite a bit of useful information and there are examples on the Hub, but you don’t have to limit yourself to KNIME-specific material. Apart from that, you might just need better data.

badger101 · August 21, 2022, 3:42pm

@Daniel_Weikert @elsamuel Thank you both for the kind feedbacks. Both of you mentioned feature selection, and @elsamuel also touched on 2 more things. Just now, I looked up on the Hub for these phrases and it seems that I’ll have to do more studying. Thank you for pointing at specific things I should look into. Very helpful for someone in my position.

@Daniel_Weikert , unfortunately at this moment I don’t have any external workflow to compare to (yet). But I did upload my abridged workflow up there if you haven’t looked into it in detail. What I’ll do in the next coming days is to download several workflows from the Hub and learn what I can, apart from watching more YouTube videos.

aworker · August 22, 2022, 12:50pm

Hi @badger101

I will be happy to help on this too

I agree with @elsamuel & @Daniel_Weikert comments. As @elsamuel mentioned, most probably your data does not convey enough information or the right one to achieve any correct classification so far.

The best would be to start verifying from the right beginning of the whole supervise classification process that everything is ok, step by step, making sure that the right ones are wisely chosen

To start with, could you please tell us a bit about your data source and how you calculated your descriptors and classes (Suggested / Not suggested) ?

Thanks & regards,
Ael

badger101 · August 22, 2022, 4:27pm

Thanks @aworker , here are some things about the data source:

I can’t disclose the real example but what I’ll do is I’ll come up with a parable. Let’s suppose that Gwendolyn, a senior human resource manager is assigned a task to build a model that predicts whether a candidate employee is to be suggested for promotion, based on the content in their promotion application forms. Let’s say that the company keeps a record for all past forms that an employee fills in whenever they wanted a promotion. And it just so happens that all the past forms for unsuccessful applications were gone, and all that’s left is the forms filled by successful applicants. This complicates the labeling process a bit, because now all forms belong to the ‘suggested’ label. But she has to make do with what she has. So, here’s what Gwendolyn did:

She came up with a forced labeling process where each form now is assigned a label of either ‘suggested’ or ‘not suggested.’ To assist her in doing so, she considered several criteria outside of the content of the form. For example, she looks at the current position of the former applicants and assigns a score according to the appropriate job hierarchy at the workplace. She then ranks these scores and decides a cut off point to divide the sorted scores into the appropriate class. She repeats this process with the other criteria she deemed relevant. To finalize this, she decides on rules. For example, if a form is assigned ‘Suggested’ based on, let’s say 3 out of 4 criteria, she assigns it as ‘Suggested’. The remaining will belong to the ‘Not suggested’ class.
As for the content of the form (independent variables), she prepared the term spaces, following all possible preprocessing steps for the English language. She filters out certain ‘junk’ words by having her own stop word list, she standardizes some phrases eg. all ‘first’ will be standardized to ‘1st’. She also performed lemmatization. (The forms rarely contain spelling mistakes so there was no need to check for errors in that area in this case).
She now intends to predict whether new applicants will obtain a successful promotion based on the content of the form that they fill. Gwendolyn realizes that if the company has all the past forms including the unsuccessful applications, it will make her labeling a lot easier, without having to go through the forced labeling process that she did. But again, she has to make do with what she has at hand. She hopes that when her model predicts new forms as ‘Suggested’, it was done with a high confidence. The ‘Not suggested’ applications, on the other hand, even though might be false results, at least will let her play it safe.

I hope that covers it all, @aworker !

Also, I am still studying feature selection. From what I gathered after yesterday’s feedback and watching YouTube videos afterwards, I will have to pick one method of feature selection which suits my data (between the filter, embedded and wrapper methods). I haven’t proceeded with any yet.

badger101 · August 22, 2022, 5:02pm

With regards to this matter of feature selection, I wonder what nodes to use to check what @elsamuel checked above for ‘variance’ and ‘correlation’, and whether these statistics suit my dataset according to this figure below which I came across yesterday, suggesting that my dataset should consider Chi-Squared:

But please correct me if I’m wrong.

badger101 · August 22, 2022, 5:56pm

Here’s the latest update after I introduced 3 completely new features (metadata) to the same model that are not related to termspaces. Both accuracy and Cohen increased. I still haven’t applied any feature selection technique still. Hoping when I do, results are even better. The clouds show some silver lining now.

aworker · August 23, 2022, 9:07am

Hi @badger101

Challenging and very well described problem as usual in all your posts !

Thanks for all the details. I guess they will help the forum a lot to suggest improvements and maybe different solutions eventually.

I understand that the original data from where you calculated the descriptors is confidential but could you at least post the descriptors with classes so that we can play a bit with it and suggest alternative solutions ?

Thanks & best regards,
Ael

aworker · August 23, 2022, 9:15am

Hi @badger101

Just an extra comment on this new version with its results. As you most probably have noticed the results in terms of Accuracy & Kappa seem better than in previous version you posted. However, it is because the current model version has now a bias towards the “Not Suggested Class” at the point that the classification of “Suggested” samples are below 0.50 in some of the statistics such as Recall, Sensitivity & F-measure. This should be avoided in the case of well balanced data, even if Accuracy & Cohen’s Kappa tell you that results have improved. In general, all the estimators should be taken into account and considered carefully in the case of a well balanced data set with Sensitivity and Specificity having similar quality.

Hope it helps.

Best
Ael

badger101 · August 23, 2022, 10:01am

Thank you @aworker for the time you consider to spend to help me out. As requested, here’s my (updated) dataset embedded in the abridged workflow:
DT 2.knwf (873.1 KB)

Here’s a snapshot of what you’ll see in the workflow:

Here are the details:

The termspace columns are all real data points that I used.
The metadata columns are the additional independent variables I added. I transformed them from numerical values to categorical values (but if you think I should keep them as numerical values instead, I’ll provide the numerical data. Just let me know).
The class column contains my revised labeling. I think post reflection after what I wrote, I believe the new labels better reflect what they ‘measure’, the tendency for applicants to be successfully promoted (or suggested) versus the tendency to less likely to be promoted.

Hope it makes sense?

On another note, I noticed that everytime I reset the Partition Node which automatically resets the downstream nodes, and re-execute the workflow, the Score stats change. Is this normal behavior?

aworker · August 23, 2022, 10:13am

Thanks @badger101 for sharing and my pleasure to try to help a so active and generous forum member !

Yes, this is perfectly normal because partitions should be always random and independent of any other former/ latter random partition. In fact, one should run the partition a few times (for instance 6 times) and calculate Standard Deviations on the statistics. This could be done just independently or in a better way, using Cross Validation (-X-Partitioner- & X-Aggregator- nodes):

Hope it helps.

Best
Ael

Ps: I will play a bit with your data, thanks.

badger101 · August 23, 2022, 10:16am

Alrighty, I look forward to be enlightened when you’re done!

aworker · August 23, 2022, 12:36pm

Hi @badger101

Just to let you know that the column headers are now “in clear” in your new posted version in case you would like to amend this …

Best
Ael

badger101 · August 23, 2022, 1:02pm

Yes those are not confidential @aworker … they’re the real data points I’m using (post processed). Although I might tune it up a notch with further elimination of ‘junk’ terms. e.g. eliminating all prepositions, and 1-figured numbers etc.

Daniel_Weikert · August 23, 2022, 3:59pm

Have you taken a look at the feature importances. Might be the case that most of the accuracy is driven by a few (including your new features) features.

badger101 · August 23, 2022, 5:40pm

@Daniel_Weikert , thanks for the info, I deleted my previous response now that I understand a bit more about feature importance when I came across this article. Seems like I can automatically check using the GFI Node in Knime.

Let’s say I utilize this node in my workflow to find out what variables I should keep. Does it automatically remove the unimportant variables from my model, or do I have to remove them manually afterwards?

aworker · August 25, 2022, 7:52am

Hi @badger101

After reading your explanations and playing with your workflow, I have a few questions concerning your data:

You said that the “Suggested” labeled samples correspond to promoted people. Is this the case for all of them in your data or there are among them some that “Gwendolyn” has labeled by hand following her own rules ? I believe this is important to know to correctly tackle the problem.
I guess from column headers that they just reflect the presence or absence of their word in the every application form ? In your opinion, would it make sense to have the frequency of the words as data instead of the presence/absence (1/0) of the word in the text ? Are any of these words significantly repeated in the same application when they appear in the text ? I guess you have not considered/tried the use of n-grams as input data ? In your opinion, would this make sense ?
Is it possible to know how you calculated your labels and “metadata” columns and why the latter may help ?

Besides all this, there is something that strikes me: If Gwendolyn generated the rules to label the “Not suggested” samples, they should be simple in principle and a Decision Tree should be able to find them easily which happens not to be the case. I’m intrigued about it. It would help to better understand the logic of all this if we could know the rules, just to work out whether the whole approach makes sense or not.

@badger101 thanks in advances for your replies !

Best
Ael

badger101 · August 25, 2022, 9:03am

Thank you again @aworker , here are my responses:

The original observations are such that all units of population belong to the same single label. So, Gwendolyn decided to divide the population into two labels, following the rules she came up with.
Yes it is based on presence or absence of the words. I can see the relevance of the idea for frequency-based variable as well, so I am flexible in terms of how we approach this issue. My only goal is to make use of this dataset to learn decision tree. However way to approach the problem is fine to me. I value the experience you’ve accumulated so I am all ears.

(Update: What if I say that if we calculate the relationship of term frequency with that of the assigned class, we have ventured out of the problem, since I am not looking to find important keywords - I am assigning documents to class based on its properties like its termspaces and metadata, rather than assigning keywords to class.)

Regarding the n-grams, it’s not that I have not considered it. If I recall correctly, I wanted to leave n-grams alone first, just to see how my model turns out. A simple analogy is that I just bought a new car, and when I first test-drived it, I didn’t care much about whether the beam light is functional, as long as the brakes and the gas pedal work. Outside that context, of course, I will care about additional features of the new car.

Regarding the label calculation, I’ll have to come up with another analogy to explain the calculation. It might take me a while to come up with a good one to balance confidentiality with reality.

I’ll update it here later today!

badger101 · August 25, 2022, 10:46am

And here’s my update:

Because you asked about the rules Gwendolyn came up with, I have had the chance to revisit them. Contemplating with a clearer mind, I realize now that there should be no labeling issue at all, because the way the two classes are assigned are natural, not forced. So please disregard my earlier analogy about certain application forms are missing etc. Having said that, here is a much better analogy, and a better explanation on how the labeling was done:

Gwendolyn is a senior human resource manager working at a real estate company. She was tasked to predict whether an employee applying for a promotion will become better assets for the company upon being promoted. This particular real estate company assesses whether an employee is an asset based on the following criteria:

How many municipalities are covered in the employee’s listing portfolio. = WIDTH
How many listing units altogether does the employee own in his/her portfolio. = DEPTH

Both WIDTH and DEPTH are the 2 facets that make up the assigned label. A cut-off point was decided as the threshold. If the employee exceeds this threshold for both facets, then the label assigned to him/her is ‘PERFORMING ASSET’. Otherwise, the assigned label is ‘AVERAGE ASSET’.

Predictors used (remember, this is only an analogy, so they might not appear as good predictors seen at face value):

How many compliments from the clients (positive feedback) has the company received about this employee since he/she worked here? = METADATA 1
How many property viewings this employee had done over the last, say, 6 months? = METADATA 2
Average monthly traffic size of the employee’s listing page = METADATA 3
What is the content of the application form (words describing their past experiences, their aspirations, their motivations, their personal justifications etc.) = TERMSPACE

The data available to be used by Gwendolyn is the past application forms. Applicants for promotions, past and present, had to fill in all 4 of the predictor values when they applied for a promotion.

Each form filled by former successful applicants were labeled either PERFORMING ASET or AVERAGE ASSET based on their current WIDTH + DEPTH performance as mentioned earlier.

All past forms filled by unsuccessful applicants were not used in the analysis.

If the model performs well, then the model can inform if a new applicant is worthy of a promotion.

…

Also, regarding the notion of using frequency instead of presence/absence, in reality, how frequent a term appears does not matter in this particular case. So it seems like there’s no need to replace the termspace with the term frequencies.

…

Overall, I hope this revised version of how I present the problem makes more sense! @aworker