WEKA: Takes forever to apply Target Column when there are many columns

beginner · February 4, 2015, 9:16am

All WEKA learner nodes take a very, very long time to apply the selected target column especially if there are many columns in the input (like several 100 or more). Very long = often longer than the training. The reason for the many columns are expanded bit vectors.

I feel there is some kind of bug / issue that makes this so slow. Or some-precalculation that might be better suited for a later time. i don't know except it is somewhat annoying. It also doesn't helpt that per default it always chooses the last column even if it is a numeric column and a classifier (eg, like RandomForest) is used that does not support numeric target column. In that case maybe choose the only string (or first) string column per default.

Thanks for your support.

EDIT:

A second maybed related issue is that the Column Filter Node also takes very long to load with many columns. Also when moving columns from include to ecxclude or vice versa it also takes rather long. It still only like 1000 columns so that should be very fast.

Aaron_Hart · February 19, 2015, 12:32pm

Hello,

I had a quick look using KNIME 2.11 + weka 3.7 and 3.6 and was unable to reproduce the problem. Is there any chance you can either upload an example workflow or tell us more about your operating environment?

Thanks in advance,

Aaron

beginner · February 20, 2015, 2:06pm

See attached.

Column filter takes 13 sec. to load configuration dialog. Thats somewhat ok. Tales another 7sec to remove all columns. Adding all is instantenous. Don't see why removing all columns has such a huge penalty. Anyway, one can work with this, it's just mildly annyoing especially because I don't see the reason why it should take so much time. Note that if fingerpinrt is changed to 1024 bits (from 2048) the issue does not happen! Configure loads fast.

Howver sometimes it actually works fine? Memory issue?

Then in Weka after changing target column it takes very, very long to apply the change while cpu usage is at 25% ((eg 1 core on 2 Core 4 Threaded cpu). I'm talking liek 10 minutes with 2048 columns, This is slow always.

Windows 7 32-Bit (maybe 32 Bit is the issue?)

weka_test.zip

nivcoh · May 17, 2015, 4:12pm

Hi,
Got the same issue. I've got large set of data with free text comments that I'm trying to cluster.
I've got about 34000 columns and I've waited about 13 hours before I gave up just so the Target column selection of the Random Forest(3.7) or the CVParameterSelection will apply. (Using Knime 2.11.3)

I also tried to change to -Xmx8192m within the knime.ini file... no win.

Please advise...

carpa_jo · October 7, 2015, 2:07pm

Hey there,

Got the same issue. Don't know if any of you is still looking for a solution but this worked for me:

Use the Column Resorter right before your WEKA-Node and make sure that your target column is the last one in the Column Resorter-Node, because the WEKA-Nodes seem to use the last column as target column by default. After this there should be no need to change the target column in the WEKA-Node. I know it's not a real solution but at least a workaround...

beginner · October 8, 2015, 8:51am

thanks carpa. Very clever trick. I've since moved on to R and Python (scikit-learn). They are just more stable and perfromant.