Filtering string on "begins with" alpha or numeric character

spett · September 7, 2022, 3:28pm

Newbie question on my first workflow! I’ve spent hours searching the forum and other online resources but to no avail!
I have a data column that is a string with items of varying lengths, some of which begin with numerical characters (which I want to keep) and some alpha characters (which I don’t) - see example below:

How do I do this? I’ve tried a Row Filter node with some regex, which I’m not that familiar with so am failing to get the syntax or the node settings right:

I’ve also tried with the Rule Based Filter node using the following (to select only records which start with a number) - $Activity code$ MATCHES “.([0-9]).” => TRUE but this returns an empty table!

Is there an easy way to do this or do I just need to learn regex (not ideal as the workflow needs to be shared with other non-IT types, so I want to keep it reasonably simple, even if more long-winded!)

*Bonus query - *
I then need to split the remaining number first cells into two using the ‘-’ delimiter, but some strings have a second ‘-’ delimiter, for instance: “12878-Inter-related mechanisms”. I want to split into “12878” and “Inter-related mechanisms”. I’m using a String Manipulation node to find the first delimiter location and split a substring as follows:
substr($Activity code$,indexOfChars($Activity code$,“-”)) which does seem to work but is there a better way?

Thanks in advance!!

ArjenEX · September 7, 2022, 3:43pm

Hi @spett

Welcome to the KNIME community!

A Regex that is able to accomondate this is ^[\d]+.* , which is somewhat to what you have but yours is checking for a digit in any position so it needs some boundaries to specify it should be at the beginning of the string (othewise it also finds Reasons to be cheerful 1,2,3).

You can see the Regex in action here with a detailed explanation on the right hand side:

Tip: try to use such a Regex engines like first to construct the proper code. You’ll see it change live and saves you a lot of time fooling around in KNIME. Be mindful that special characters need to be escaped in KNIME.

If you apply this in a regular Row Filter node and set the regex checkbox, KNIME filters the corresponding rows.

Input:

Output:

Bonus question:

An alternative is to use a Cell Splitter and set the array size to 2. This will split on the first dash and everything that follows after that will be part of the second element, including subsequent dashes.

Hope this helps!

izaychik63 · September 7, 2022, 4:36pm

For Regex solution you also may try this site
https://www.autoregex.xyz/

spett · September 8, 2022, 3:04pm

Thank you both for the feedback! I will delve into the regex stuff - seems to be pretty essential for getting the most out KNIME. I’m starting to love this tool already - nerdy data lego!!

bruno29a · September 8, 2022, 6:38pm

Hi @spett and welcome to the Knime Community.

This is a very good example for using Regex, but it definitely can be done without Regex, especially if you have trouble using Regex.

Of course, the best way to overcome this shortcoming is with help from the forum, and to use tools/sites such as those suggested in this thread.

Here’s how to do this without Regex. The trick is to evaluate if the first character is an int, which is basically what you’ve asked for, and you can do this with the function toInt() applied on the first character which you can get using the substr() function. So a nested function like this will work:
toInt(substr($Your Column$, 0, 1))

So if I have this as data:

I’d get this as results:

To make it even clearer, I’ll do this manipulation in 2 nodes instead of doing a nested function. It’ll be the same results, but with the functions separated, you can see the results at each stage.

Workflow would be like this:

The first manipulation is to get the first character:

I used this expression: substr($column1$, 0, 1), and I save the results to a new column called “first_char”.

Results:

In my second manipulation, I evaluate that first character using the toInt() function:

And this is what I get as results:

So, those that are not numeric will evaluate to empty as you can see.

I then exclude the rows that have missing value in the first_char column, using the Row Filter node:

And there you have it:

You can then do the Cell Splitter as @ArjenEX showed you to complete the 2nd part of your question.

Of course, as I have stated before, you can also do this in 1 manipulation, by using the following nested function:
toInt(substr($column1$, 0, 1))

system · September 15, 2022, 6:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.