GC overhead limit exceeded while Tree Building

Hey guys,

I’m totally new here and a hardcore newbe with Knime. I have a .csv with 8 collumns, 7 with a lot of strings and 1 with true/false. I want to built a decisiontree for this table to get some rules to classify a row, wheter it’s true or false.

The table has about 100.000 rows and i decided to configer my Decision Tree learner like this:

My Workflow looks like this:
Knime1

Now here is my problem, every time I start the algorithm, i run out of Heapspace although I gave Knime 12 Gigabyte of RAM. I can add another 16 GB, but I think 12 should be more then enought. Where is my Mistake? Please help me! :D.

Many thanks in advance!

Are you sure that this data is suitable for a decision tree? Looking at the screenshot, it seems like each row has distinct values?!

2 Likes

No I’m absolutly not sure. Which other module would you recommend?

Let’s look at Col0. If you have n rows and n distinct string values for Col0, it makes no sense to use this column as a Feature for any learner. The same holds for any other column.

the rule behinde this is, if a string in the col 0-7 contains one of the follwing chars: " %!|’/". " col 8 is true, otherwise its false.
In this case, I know the chars, but it’s only testdata, so in another case I would like to get this information from the algorithm.

Is there any way to define an algorithm to check the chars in the strings instead of the strings?

if a string in the col 0-7 contains one of the follwing chars: " %!|’/". " col 8 is true, otherwise its false.
In this case, I know the chars, but it’s only testdata, so in another case I would like to get this information from the algorithm.

Then you could build a workflow which maps those strings to their “bag of chars”, have a 1/0-encoding for each char per sample (where 1 means: “any of the strings contains the character X at least once”) and train the classifier. With such preprocessed data, DTs (and other classifiers as well) will work fine.

Simplest way to do the encoding would probably be to apply one of the several scripting nodes (Java, Python, …)

[edit] Or an unorthodox idea for a first shot for the low hanging fruits:

Try the PalladianTextClassifier (from the Community Contribs.) with a char-n-gram setting of 1

1 Like

i tried for some days to fix my problme in the way you told me in the first solution, but im curious about solution 2. Where can i find this Classifier?

thanks a lot for you hints!

Hi @wehe1017 -

The TextClassifierLearner node, which I think is what qqilihq is referring to, is available in the Palladian for KNIME extensions. You can install these by going to File --> Install KNIME Extensions… and searching for Palladian.

(@qqilihq, please correct me if I’m suggesting the wrong node!)

2 Likes

(@qqilihq, please correct me if I’m suggesting the wrong node!)

Perfectly correct! :slight_smile:

So after several trys it finally works with the palladian TextClassifierLearner! Thank you guys :wink:
Got 92% accuracy so far, I hope to improve it.

1 Like

Good to hear, here’s some additional advice:

  • An accuracy of 92% might be very good or not – make sure to check against the baseline of your data. What would the accuracy be when predicting everything with the majority class? This is your baseline.

  • In case you haven’t done so yet, have a look at the ROC curves as an evaluation measure. They give a better impressions how good your model performs over all.

  • The TextClassifierPredictor node allows to specify different scoring algorithms. You can try whether different scorers give better results.

1 Like

Hi guys. I tried several settings but could not improve my score.

@qqilihq you mentioned the “bag of words” method. I guess it goes in the way of a one-hot-encoding. I would like to try this. First i tried it with the one-to-many Node but it will only accept nominal values. Is there an example you can give me for this kind of method with a scripting node?

I also dont know how to create new columns by the scripting node

As you’re trying to classify based on the characters in your strings, you’d need the mentioned “bag of characters” vector. You can build that e.g. with a Java Snippet node using the following code (this just covers lowercase a-z for sample reasons, but you can adapt it to your needs easily)

Create an output cell which creates a Boolean array and use the following code:

out_characterVector = new Boolean[26];

for (char ch = 'a'; ch <= 'z'; ch++) {
	out_characterVector[ch - 'a'] = c_string.contains("" + ch);
}

This node will then produce a 26 element boolean array which denotes whether a given character occurs. To feed it to e.g. a DT learner, you’ll most likely need to split the collection column into individual columns. You can achieve this using a Split Collection Column node.

HTH,
Philipp

PS: This also has room for improvement – instead of a pure binary vector you could also create count vectors

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.