LIBSVM node probability columns dont correspond to probability values

stoeter · September 27, 2016, 5:48pm

Hi,

it seems that the column names for the probability values do no fit to the values.

I trained the classifier with 5 classes (4 active but different, 1 inactive). The prediction is done on the complete dataset, which additionally contains 3 more classes (2 inactive and my unknown samples). I get very good results (~95% accuracy), however it seems that the column names for the probability values are mixed up (see below). First, I filter the 5 classes, then I calulate the domain values new to avoid getting additional columns for classes without values, then I used the Learner, then the Predictor.

Explanation:

column 1: my target column (8 classes)

column 2-6: probablility columns (from leraning 5 classes)

last column: prediction

The column name of the probaablility column suppose to have the order of cARHGAP11Apool, cMock, cGFP_AM4626, cEct2pool and cEg5. This is what the data suggests.

I am using KNIME 3.1.2 on Windows 7.

Martin

ps: I just noticed that running LIBSVM in a loop demonstrates that the probability values are kind of randomly scattered amoung the probalility columns. Nevertheless the classification result is always the same and always correct (see second .png). This doesnt happen e.g. with the random forest node (see third .png)

wiswedel · October 13, 2016, 1:57pm

Hi Martin,

Weird! We'll look into it.

Thanks,
Bernd

hornm · October 19, 2016, 12:22pm

Hi Martin,

I am struggling to reproduce the problem (also tested whether LIBSVMPredictor-node instances interfer with each other when run in parallel - everything works as expected). Can you maybe provide a minimal example workflow where the problem occurs? That would be fantastic!

Best, Martin

stoeter · October 20, 2016, 8:44pm

Hi Martin,

I looked at it again, but still I can reproduce this.

Here is a workflow with the data I used. It was too big to upload, therefore download it from here (17MB):

https://cloud.mpi-cbg.de/index.php/s/qPDDug3VSAP7CVx

Just execute an look at the last sorter nodes. The samples Eg5, GFP, and Ect2 suppose to be correctly classified for ~95% of the datapoints.

Best,

Martin

hornm · October 26, 2016, 5:23pm

Thanks, Martin, for the example workflow! We found the problem and it will be fixed with the next release.