Weka with CostSensitive into KNIME

marta · October 5, 2012, 1:03pm

Dear users,

I have use WEKA to build a model and then, I have tried to implement this model in KNIME (see attached file) but without success.

I used WEKA as follows:

2D MOE descriptors of interest were calculated for the initial database (1204 cpds) using MOE2010.
Calculated descriptors were Z-normalized using an in-house SVL script.
The resulting database was splitted into training (TR) and test sets (TS) using an in-house procedure. TR and TS files were saved as .csv
TR.csv was used as input in WEKA3.6.5 to select the more relevant features. To do this, I used the CfsSubsetEval+Bestfirst option. A total of 11 descriptors were selected. The resulting file, containing this 11 selected descriptors, was saved as TR.arff.
TS.csv was opened into WEKA. The 11 previously selected descriptors were manually chosen. The resulting file was saved as TS.arff.
TR.arff and TS.arff were introduced again in WEKA to build the model. To this end:

Preprocess -> Open file -> TR.arff

Classify -> Classifier -> Choose -> Meta -> CostSensitiveClass Classifier with

Classifier -> Random Forest

costMatrix -> 2 x 2 cost matrix (cost FN = 122; cost FP = 1.7)

The confusion matrix obtained using this protocol in WEKA was:

TN=167 TP=19 FN=1 FP=53

Using the KNIME workflow shown in the attached document, I obtained:

TN=189 TP=12 FN=8 FP=31

Does anyone know what is happening? Any idea about possible problems?

Thank you in advance for your time and consideration!

marta

knime_doubt.jpg

gabriel · October 5, 2012, 2:33pm

Hi marta,

Thanks for the detailed description. Just to confirm, you are using KNIME 2.6 which comes along with a fresh Weka integration 3.6 - please check out the new nodes. I guess you use the same parameter to run this classifier in Weka directly. We also need to make sure the attribute types are the same in KNIME and Weka. KNIME handles certain values as missing, depending on the Reader node, usually '?' indicates a missing value; please check the datasets provided by KNIME. It might be worse writting out the data from KNIME (ARFF format) and then run the Weka classifier on this data - outside of KNIME. Let me know if you see and difference.

Best, Thomas

marta · October 8, 2012, 3:59pm

Hi Thomas,

First of all, thank you for your answer. I am a little bit stressed because I have tried a lot of things but noone seems to work properly.... And we should to send the KNIME workflows on Wednesday!

I compared the output of the first two nodes (2D MOE descriptor calculation and Z-normalization) with those obtained using MOE and a in-house programs directly. No differences were observed.

Then, I kept the training (TR) and test (TS) sets after their normalization (see workflow attached previously) in arff format to use them directly in WEKA. The TR.arff was correctly read and the same set of descriptors as obtained using WEKA directly was selected.

The problem came with the TS. After manually selecting the same set of descriptors as those selected for the TR, it was not possible to run any prediction. The only difference I noted when I checked the TR and TS, is that they do not present the same order in the descriptor list. For the TR, activity is placed in the last position while for the TS it is placed in the first position. Then, when I tried to run the model I got the mesage error "TS and TR set are not compatible".

Do you have any other idea? I would really appreciate it.

In any case, thank you in advance for your time and consideration.

With my best regards,

marta

gabriel · October 9, 2012, 10:23am

Hi marta,

I guess the problem is again in the domain values which seems to be different. Can you please check, sorry for bothering, the domain values in the out-port view of both tables, tab "TableSpec". If both lists are different, which they somethimes are, you need to re-calculate the domain and make sure the Learner and the Predictor sees the same set of values. A workaround for this is proposed here. Hope this helps?

Best regards, Thomas

InsilicoConsulting · October 9, 2012, 11:32am

Hi Marta,

1. The problem might be different domain values, which can be easily checked by removing both normalization nodes.

2. I normally use the reference column filter with check for "Ensure compatibility of column type" enabled. Join the training set descriptor table as the reference table and the test descriptor table as the input.

3. Sort class column of the training descriptor table using column resorter node prior to step 2.

InsilicoConsulting · October 9, 2012, 1:45pm

Point 3 means that the class column should be the last one in the table, not somewhere in the middle

marta · October 9, 2012, 5:15pm

Hi Thomas, Hi InsilicoConsulting,

Thank you again for your answers and sorry for so much trouble, but I have tried what you suggested and it did not work....

However, I would like to understand what do you mean with "domain values". Are the values associated to the descriptors? If it is right, which would be the problem?

I do not know what can I do. The result is always the same and different from those obtained WEKA directly. Any suggestion you have will be wellcomed! I need to solve the problem as soon as possible.

Have a nice day and thank you again for your help,

marta

InsilicoConsulting · October 9, 2012, 5:45pm

Hmm,

A last try, why not remove missing values from the test matrix for all columns? Hope it works out

marta · October 10, 2012, 9:58am

Hmmmmmmmmmm,

what do you mean...?

InsilicoConsulting · October 10, 2012, 10:59am

Are there any descriptors that have missing values? Many weka models mess up if they values are missing.

Use missing value node, remove all rows where any of the descriptor/Y axis values are missing/null in the training set and testing set.

marta · October 10, 2012, 12:34pm

Hi again InsilicoConsulting,

Thank you for your fast answer. I have a columnn with null values. I have filter in both TR and TS before apply the AttributeSelectedClassifier. But it also does not work.... :(

Thank you a lot for your help. In any case, I am learning KNIME....

Have a nice day,

marta

InsilicoConsulting · October 10, 2012, 12:52pm

Perhaps the attibuteselectedclassifier is eliminating more attributes using cfattributeeval and the test set again has a different set of columns while going into the weka prediction node?

Does changing the model work?

anyways my last post on this topic. i promise:-)

marta · October 10, 2012, 1:01pm

No InsilicoConsulting! Feel free for any suggestion you want to do me! I am very thankful!

I already tested it, but it seems that the training and test sets may have the same number of columns prior to AtributeSelectedClassifier (for the TR) and the WekaPredictor (for the TS) nodes.

Any other suggestion? :)

Have a nice day,

marta