Problem with SVMlight-style data with the File Reader

vince · October 14, 2009, 3:13pm

Hello, I have a file with data in SVMlight format, which have a progressive number and the “:” sign before each feature, and the label at the start of the row in numerical form (I have 2 classes, so the labels are “+1” and “-1”) which looks like this:

+1 1:0.34541 2:0.36456 3:0.53145 4:0 5:0.1784 6:0.0043
-1 1:0.21221 2:0.00434 3:0.54234 4:0.89894 5:0.12375 6:0.94756
+1 1:0 2:0 3:0.56572 4:0.74812 5:0.12764 6:0
+1 1:0.34581 2:0.98374 3:0.00456 4:0.03451 5:0.03211 6:0.56822
-1 1:0.33521 2:0.12125 3:0.67211 4:0.44699 5:0.83645 6:0.15363

etc.
If I import the file with the File Reader I still get the progressive number with the “:”, and for example “1:0.34541”, “2:0.36456”, etc. are read as strings.
How can I convert them into useful data, directly into Knime, and get rid of the “1:”, “2:”, “3:”, etc.?

Peter · October 14, 2009, 4:24pm

The Cell Splitter node or the Java Snippet node could help you.
Unfortunately they both operate only on one column - so you would need to insert 6 nodes for your input file example.
With the Java Snippet node you could replace the column with a new column of type Double and the code would probably look like this:
String c = $Col6$; String val = c.substring(c.indexOf(’:’) + 1); return Double.parseDouble(val);

vince · October 14, 2009, 4:39pm

Thank you, it works
But the result just has 3 decimals, while the original is way longer, can it be fixed? (I’m not a Java expert)

And another question: I forgot to mention that I have 62 features plus the class column (the 6 features were just for example), so putting 62 nodes is not practicable. Do you have any advice on how to iterate on all the columns?

thor · October 14, 2009, 5:34pm

Ugly, but may work: You could read each line of your file as into one string cell, use the Java Snippet node to get rid of the 1:, 2:, etc
return $Col1$.replaceAll(" \d+:", " ");
and then use the cell splitter to get the columns.
The digits of the numbers are not lost, but the render in the table view only displays three digits by default. This can be changed by right-clicking on the column head and changing the renderer to “full precision”.

vince · October 15, 2009, 4:19pm

It works like this:
return $Col1$.replaceAll("[0-9]+:", " ");
Thank you.

Ok, so it’s just a visualization style. Ok, thanks again.

Anyway, do I have to put 62 instructions like this in the Java Snippet? Is there a method for cycling across all the columns?

vince · October 15, 2009, 4:22pm

No, putting more than one instruction doesn’t work…

wiswedel · November 1, 2009, 8:41am

As Thorsten said two comments above: You could try to convince the file reader to the entire line as a single cell, process this entire string with some regular expression and later on split up the elements into individual columns.

Alternatively, you combine all your “5:0.12764” cells into a collection cell (there is a node “Create Collection Column”), which can then be processed with the java snippet node. The content of this cell is available as array to the java snippet node. As for the return type, you would use “Double” + check the “Array Return” field down below and later on split up the result collection using a “Split Collection Column” node.

Makes sense?