Strange problem with row filter

jp · August 11, 2008, 5:33pm

The concept in short :
I have a MySQL database with - among others - 13 columns containing coded information that I have to select for. The codes are 1 letter + 3 digits, like this : A445, I125, X589 and so on.
I want to select any rows where any of the 13 columns match a given code, i.e code A256 might be in either or perhaps several of the 13 columns.
Obviously I could do this one column at a time by using a row filter. If I use a row filter (selecting "Column Value Matching" and "Include rows by attribute value") on e.g column 1 I get the expected result.
To avoid having to match 13 different columns, I used the Column Combiner node to combine the 13 columns (they are all strings, obviously) into one, the idea being that I could then search the resulting column.
The problem : I get an empty data table as output !
I have written the data to a CSV file and I can see the combined column, and some of the data does match, as far as I can see.
Using the same regular expression on one of the original columns does work.

What is going on ?

/jp

unknown_user · August 11, 2008, 6:00pm

Does your regular expression contain any wildcards? If not, it might be that the expression tries to match the entire (combined) string, for instance "A445, I125, X589". That's why it probably works on a single column but not on the combination of them. Could you try to use an expression such as ".*A445.*" (without quotes, it matches the sequence of none, one or many characters (.*) then A445 and then again a set of characters.)

jp wrote:

The problem : I get an empty data table as output !

To make sure: The row filter produces en empty output table, not the column combiner?

Bernd

unknown_user · August 11, 2008, 9:22pm

I tried both with and without wildcards - to select multiple codes I wanted to use something like "I25[0-9]", but again it works with the original columns but not with the combined columns.
I have just tried it with the following expression : .*I251*. , but again it returns an empty data table.
Yes, it is the row filter that gives the empty output table - I tried writing the table to a csv file after the column combiner and it shows the extra column as expected.

To make matters even more mysterious : I have found a workaround. By using a String Replacer node I can search for e.g I25* and replace the entire string with TRUE. Using the row filter with the expression "TRUE" now works as expected. (I realize it now matches the regular expression "TRUE" against the string "TRUE" so it´s not quite the same situation as before, so the problem could lie with the scope of the regular expression, at least this doesn't rule it out). What I do not understand, however is why the String Replacer node can match the codes whereas the Row Filter node can not.

/jp

unknown_user · August 11, 2008, 9:51pm

I tried to reproduce the problem with a small data set. The table is (includes row + col header):

Key  Col1 Col2
Row1 I123 J456
Row2 A789 I123
Row3 I124 C012

If I then merge the two columns using the default settings of the column combiner and filter for, e.g. ".*I123.*" I get what one would expect (row 1+2). It also works with ".*I12[34].*" (in which case it returns all rows).

Can you try it, too? If it works, can you send us a sub-set of your data set so that I can see what's different?

I'm sure you have noticed the shortcut for viewing the output of a node (no need to write it to csv first): Just right click the node and then hit "Data Output 0: ..."

Regards
Bernd

unknown_user · August 12, 2008, 1:17am

Success !
I have just tried it with the expression .*I25[0-9].* and it works ! Thank you, this has been really helpful to me - at least now I know that it does work and the mistake was all mine. Time to read up on regular expressions again :?
I think my mistake was in getting the scope of the regular expression wrong, although I did try to test the it using the excellent regexdesigner software
(http://www.radsoftware.com.au/regexdesigner/) if anybody is interested.
Am I correct in assuming that the reason that String Replacer worked while Row Filter didn't is that String Replacer doesn't use regular expressions to match ?

Now that we know this works, this could be included in a future manual or guide as a way of matching multiple columns against one expression without doing multiple Row Filters. Or perhaps I could make the suggestion to have a dedicated node - perhaps called Multirow filter - in a future version of Knime ?

Anyway, thank you once again for your help, it is greatly appreciated
/jp :D

unknown_user · August 12, 2008, 10:26am

Glad that it works now. The different behavior of the string replacer and the row filter node are probably due to the fact that one supports non-regexp wildcards whereas the other does not. That will change in 2.0 - then also the "Row Filter" node has a checkbox, which allows you to state whether the search string is a regular expression or a simple wildcard match (with support for '*' and '.').

Regarding the multi-column row filter matching: I'll bring it up for discussion within the group. Your solution with a separate column combination as preprocessing step works in many cases but probably not all (how would you do range checking?)

Thanks
Bernd

unknown_user · August 12, 2008, 3:43pm

Thanks again for the help.
I do realize that my solution only works with strings, that is partly why I suggested a multirow filter node might be a good idea. I am sure that I am not the only one who would need such a node on occasion.

Thanks,
jp