RegEx extraction

Hello guys,
I would need some help with a regular expression extraction (RegEx Split node).
I have tried for days to figure it out but could not find the solution. Neither the KNIME examples, nor the (Java API) indications did help me.
I have a string (text) column from which I want to extract a number. The column it is not omogenous (structure), neither the number (by figures number).

RTRT CVFD Expert SRLYO45656653**CityBUBU
Nuai Cuyt Viter S.R.L.YO46756885CountryBGBG
Vujyyu France SRL45692787Street Huissa
Street UPHILL Bakery SRL85434556Hel British

I need to extract the bold string (figures) by indicating in the RegEx that the figures come either after SRL, S.R.L., YO syntaxes.
It may be necessary for a multistep approach, but still, I need an idea how to start the process.

Hello @tazar,

this should work for given data in Regex Split node:
.*[SRL|S\.R\.L\.|YO](\d+).*

Br,
Ivan

8 Likes

Thanks a lot @ipazin . It seems like it is working.

However, there are cases were there are two „YO” syntaxes and the result of the regex is the last one, when it should be the first. I dont know why it ignores the first YO.

1 Like

Hello @tazar,

in that case make regex lazy by adding question mark after the first asterisk. See here for explanation:

Br,
Ivan

4 Likes

Hey @tazar and @ipazin,

A simpler approach to this problem would be to match all number sequences of N digits or more. This may or may not be feasible depending on the rest of the data in the text blob.

For instance, the regular expression \d{5,8} matches any number sequence between 5 and 8 characters long.


Screenshot source: https://regex101.com/

It’s difficult to handle all of the variations and possible typos in RegEx alone if you are dealing with unstructured (a.k.a., “free-form”) text blobs. You might want to preprocess the text before extraction. Preprocessing steps that apply here include (but are not limited to):

  • Normalizing all letters to lowercase
  • Removing all special characters
  • Normalizing N spaces (two or more spaces in a row) to single spaces (" ").

By applying simple transformations such as the ones above to your data, you can drastically reduce the amount of possible scenarios you must handle with your RegEx. There are a couple related components in my public space on KNIME Hub for text processing if you’d like to check them out.

Cheers,

@sjporter

7 Likes

Hello @sjporter,

Definitely a valid approach which could spare me hours trying to catch odd variations with regex :sweat_smile:

But you have to match entire string in Regex Split and String Manipulation node so not sure how this alone would work? :confused:

Br,
Ivan

2 Likes

Hey @ipazin,

Here’s an example :slight_smile: I used Python for the RegEx extraction step, but there are a few other options. I also used two components which are available in my public space on KNIME Hub:

Cheers,

@sjporter

4 Likes

hello @tazar

some time ago I created the following component to help me with similar tasks.

See if it helps you:

As it is in java I find that it is much more performative than using python scripts

4 Likes

Also worth a look is the “Regex Extractor” node which will make regex extraction as easy as one of those fancy browser-based tools – but right inside node. More information is available here:

Have fun!

3 Likes