I would need some help with a regular expression extraction (RegEx Split node).
I have tried for days to figure it out but could not find the solution. Neither the KNIME examples, nor the (Java API) indications did help me.
I have a string (text) column from which I want to extract a number. The column it is not omogenous (structure), neither the number (by figures number).
RTRT CVFD Expert SRLYO45656653**CityBUBU
Nuai Cuyt Viter S.R.L.YO46756885CountryBGBG
Vujyyu France SRL45692787Street Huissa
Street UPHILL Bakery SRL85434556Hel British
I need to extract the bold string (figures) by indicating in the RegEx that the figures come either after SRL, S.R.L., YO syntaxes.
It may be necessary for a multistep approach, but still, I need an idea how to start the process.
It’s difficult to handle all of the variations and possible typos in RegEx alone if you are dealing with unstructured (a.k.a., “free-form”) text blobs. You might want to preprocess the text before extraction. Preprocessing steps that apply here include (but are not limited to):
Normalizing all letters to lowercase
Removing all special characters
Normalizing N spaces (two or more spaces in a row) to single spaces (" ").
By applying simple transformations such as the ones above to your data, you can drastically reduce the amount of possible scenarios you must handle with your RegEx. There are a couple related components in my public space on KNIME Hub for text processing if you’d like to check them out.