How to extract a number from a text column

Dear all,

I would like to extract a number from a text-column which contains many other numbers.
Example => Date: 01.01.2024 / 11:55:22 submitted by example company XY 123/456/789 80 1.111.111,50 81 222.222,22
Number to extract => 1.111.111,50 & 222.222,22
I always need the number following ‘80’ and ‘81’. But not all columns contains that numbers.
Does it work with the ‘column expression’ node?

I have no idea how to do this.

Thank you in advance if you can help!

Hello @awanninger

Welcome to KNIME Community.
I’m not sure how your database and all expectations look like, thus I used ‘Date: 01.01.2024 / 11:55:22 submitted by example company XY 123/456/789 80 1.111.111,50 81 222.222,22’ string as the input data. Then, I have used Cell Spitter nodes twice with:

  1. '80 ’ (please mind the space following 80) as delimiter referrring to input data column;
  2. '81 ’ as delimiter referring to the very last column, created after cell split.

I was able to extract both expected values:

Please validate my approch on your dataset.
Happy KNIMEing,
Kaz

Hello @Kazimierz

Thank you for helping me.
I’ve already tried the cell splitter, but sometimes the string is longer or shorter that the example or there are more numbers before the first ‘80’.
At the end the columns are filled differently.

In the above proposal, lenght of string and lenght of sub-strings are not important, so '80 ’ and '81 ’ delimiters splits the string. Spaces at the end of delimiters ensure that just ‘80’ and ‘81’ numbers won’t split cells.
Well, small improvement: you could use ’ 80 ’ and ’ 81 ’ with leading and following spaces to be more precise.

Hi,
If you want to do it with regex, something like this should work in the Regex Split node:

.*?(?:80 (\d+(?:\.\d+)*(?:,\d+)?))?(?: ?81 (\d+(?:\.\d+)*(?:,\d+)?))?

It yields:

Kind regards,
Alexander

2 Likes

Hello @AlexanderFillbrunn

thank you for your help.
It works very well, but if there is more text after the numbers, it no longer extracts the numbers.
Example => Date: 01.01.2024 / 11:55:22 submitted by example company XY 123/456/789 80 1.111.111,50 81 222.222,22 xyz 12345 xyz

Does it also work in the case of a negative number? (Example 80 -12.345 → Result -12.345)

Thanks in advance.

Hi,
I think if you just add .*? at the end of the regex, it will catch your rest.
For negative numbers, you need this (the -?):

.*?(?:80 (-?\d+(?:\.\d+)*(?:,\d+)?))?(?: ?81 (-?\d+(?:\.\d+)*(?:,\d+)?))?.*?

Kind regards,
Alexander

1 Like

@AlexanderFillbrunn
I tried this solution, but with them it extracts not a single number anymore

Hi,
You are right and it is weird, because according to regex101, a very helpful online tool for regex, it should work: regex101: build, test, and debug regex.
Let me investigate and come back to you when I have more info.
Kind regards,
Alexander

@awanninger This is a perfect case for the Regex Extractor which I built for the Palladian Nodes. You can edit and preview your regex right in KNIME with your actual data and immediately see the results - no tedious trial and error and back and forth:

You have plenty of output options, ranging from rows, columns, collection columns, JSON, etc.

The regex which I used here is: (?:80|81)\s(?<extract>[0-9.,]+) which translates to:

  • start with the string 80 or 81
  • followed by a space
  • followed by a “numeric” expression to be put into the variable extract which can consist of the characters 0-9, ., and ,.

The Palladian extension is available here:

The sample workflow is available on my NodePit Space:

3 Likes

Hi,
A colleague gave me a hint. You can use .*?(?:80 (-?\d+(?:\.\d+)*(?:,\d+)?))?+(?: ?81 (-?\d+(?:\.\d+)*(?:,\d+)?))?+ as the regex and that works.
Kind regards,
Alexander

2 Likes

@AlexanderFillbrunn
That works!
Thanks a million for your help.

@qqilihq
I don’t have this node, but I’ll definitely take a look at it. It looks pretty simple.
Thank you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.