RegEx extraction

tazar · October 13, 2021, 3:15pm

Hello guys,
I would need some help with a regular expression extraction (RegEx Split node).
I have tried for days to figure it out but could not find the solution. Neither the KNIME examples, nor the (Java API) indications did help me.
I have a string (text) column from which I want to extract a number. The column it is not omogenous (structure), neither the number (by figures number).

RTRT CVFD Expert SRLYO45656653**CityBUBU
Nuai Cuyt Viter S.R.L.YO46756885CountryBGBG
Vujyyu France SRL45692787Street Huissa
Street UPHILL Bakery SRL85434556Hel British

I need to extract the bold string (figures) by indicating in the RegEx that the figures come either after SRL, S.R.L., YO syntaxes.
It may be necessary for a multistep approach, but still, I need an idea how to start the process.

ipazin · October 13, 2021, 3:21pm

Hello @tazar,

this should work for given data in Regex Split node:
.*[SRL|S\.R\.L\.|YO](\d+).*

Br,
Ivan

tazar · October 13, 2021, 6:46pm

Thanks a lot @ipazin . It seems like it is working.

However, there are cases were there are two „YO” syntaxes and the result of the regex is the last one, when it should be the first. I dont know why it ignores the first YO.

ipazin · October 14, 2021, 8:29pm

Hello @tazar,

in that case make regex lazy by adding question mark after the first asterisk. See here for explanation:

Br,
Ivan

sjporter · October 14, 2021, 9:42pm

Hey @tazar and @ipazin,

A simpler approach to this problem would be to match all number sequences of N digits or more. This may or may not be feasible depending on the rest of the data in the text blob.

For instance, the regular expression \d{5,8} matches any number sequence between 5 and 8 characters long.

Screenshot source: https://regex101.com/

It’s difficult to handle all of the variations and possible typos in RegEx alone if you are dealing with unstructured (a.k.a., “free-form”) text blobs. You might want to preprocess the text before extraction. Preprocessing steps that apply here include (but are not limited to):

Normalizing all letters to lowercase
Removing all special characters
Normalizing N spaces (two or more spaces in a row) to single spaces (" ").

By applying simple transformations such as the ones above to your data, you can drastically reduce the amount of possible scenarios you must handle with your RegEx. There are a couple related components in my public space on KNIME Hub for text processing if you’d like to check them out.

Cheers,

@sjporter

ipazin · October 15, 2021, 10:39am

Hello @sjporter,

Definitely a valid approach which could spare me hours trying to catch odd variations with regex

But you have to match entire string in Regex Split and String Manipulation node so not sure how this alone would work?

Br,
Ivan

sjporter · October 15, 2021, 2:10pm

Hey @ipazin,

Here’s an example I used Python for the RegEx extraction step, but there are a few other options. I also used two components which are available in my public space on KNIME Hub:

Cheers,

@sjporter

natanaeldgsantos · October 15, 2021, 8:04pm

hello @tazar

some time ago I created the following component to help me with similar tasks.

See if it helps you:

As it is in java I find that it is much more performative than using python scripts

qqilihq · October 17, 2021, 7:16am

Also worth a look is the “Regex Extractor” node which will make regex extraction as easy as one of those fancy browser-based tools – but right inside node. More information is available here:

Have fun!

system · April 17, 2022, 7:16pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.