Decision Tree Query

Stuck1 · March 12, 2019, 12:29pm

Hi,

I was building a simple workflow around decision tree to demonstrate a concept, basically I have two fields 1. Forenames and 2. Gender. I wanted to run the decision tree to demonstrate that you can predict gender based on someones name (if it’s in the training data!). It runs fine on a small amount of data. E.g. 100 names and a few thousand records. However, if I up the data (partition) that I feed into the learner node the decision tree learner doesn’t split at all, so the only prediction is based on the top level. I.e. It predicts all male as that’s 51% of the sample. I can’t figure out why this is, so any pointers would be great.

I realize that this prediction method has major flaws, but that’s kind of what what I want to demonstrate

mlauber71 · March 12, 2019, 10:22pm

From your description I do not think this can work. A name is thought to be mostly either male or female (leaving aside Sascha a few others and some debate about gender and sex). So what would be there to learn except a dictionary?

Only exception might be if you could derive basic forms or syllables or characters at the end that might signal male or female (at least in english or french or so) like - names that tend to end in -a are xyz% considered female.

So I think to demonstrate what a Decision Tree does this is not the right example.

Stuck1 · March 13, 2019, 7:13am

Thanks for the reply, I realise that this method is flawed, that’s what I want to demonstrate. However it does work, if I just have 2 names (jack and Jill) it will be able to predict that someone new called Jill is most likely a female. The decision tree branches with a small number of names. However when there are a lot more names in the data there are no splits at all, despite the config being unchanged. Why?

A linguistically analysis of names is the next example, as you say letter correlations, etc.

AlexanderFillbrunn · March 13, 2019, 11:53am

Hi Stuck1,
maybe there are too many names and therefore no domain information is generated? The default threshold for that in KNIME is 60 possible values, I think. If there are more in a single column, no domain information is attached. What happens if you uncheck “Skip nominal columns without domain information” in the Decision Tree Learner’s configuration dialog?
Kind regards
Alexander

Stuck1 · March 13, 2019, 1:01pm

Thanks, that seems to have fixed it. I was unaware of this threshold, it might even be lower than 60. That’s good to know. Thanks.

AlexanderFillbrunn · March 13, 2019, 1:13pm

Hi,
glad I could help. Instead of using this option, you can also force KNIME to create larger domain information using the Domain Calculator node.
Kind regards
Alexander