Regex for split a string in multiple parts

Hello,
I have some strings like these below:

EXAMPLE SPA (Tss-Ex Esa-Diretta Hub Emilia Romagna)1321GIURELLI STEFANO S.A.
EXAMPLE TEAM s.r.l.8060493PET EXAMPLE DI EXAMPLE EXAMPLE
SER.IN EXAMPLE INFORMATI115340812EXAMPLE

I would like to find a smart way within knime for split each single string in multiple strings like this:

column1 -> EXAMPLE SPA (Tss-Ex Esa-Diretta Hub Emilia Romagna)
column2 -> 1321
column3 -> GIURELLI STEFANO S.A.

column1 -> EXAMPLE TEAM s.r.l.
column2 -> 8060493
column3 -> EXAMPLE DI EXAMPLE EXAMPLE

column1 -> SER.IN EXAMPLE INFORMATI
column2 -> 115340812
column3 -> EXAMPLE

Probably a regex would be the way?..
Thanks ins advance.

You could do that with a Regex Split
(^[^0-9]+)(\d{2,})(.*)

=> it looks like your split always is the number. Here I used two or more numbers as ‘identifiers’. You would have to test a lot of examples to see if that always works, sometimes you have to add something to make special cases work. Esp. very short lines or if there is nothing in front of the number but the number still should go to the second column. I have not tested all the possibilities.


() represents a group to be split into columns

first group :
(^[^0-9]+) => anything that is not a number 0-9

second group
(\d{2,}) => a string consiting of 2 to n numbers

third group
(.*) => anything after the second group

You could test your Regex here

split_regex.knwf (11.2 KB)
regex_split

2 Likes

Hello @mlauber71,
thank you for the great example and for the regex explanation.
Really much appreciated!

What if I have also stranger cases? Like for example:

EXAMPLE Spa (TS2)1568220022EXAMPLE SRL
4WD EXAMPLE SRL1568330227EXAMPLE S.R.L.
EXAMPLE & 2EXAMPLE1568330227EXAMPLE S.R.L.

In this case the first split could probably contain at least a number within the part in which we supposed that we should expect only any character that is not a number. Is there a way always with the regex syntax for tell to go over and don’t split in that point?

Maybe we should consider the case of length. For example if we have a random number within the first split we could tell it to go over if the length of the number sequence is just 1 or 2 for example. I’m pretty sure that there are better solutions than my suggestion.

Let me know what you think and thanks again for the help.
~gujo

I thought about this and came up with a solution. I changed the first group to something like this. You might have to toy around with it to see if it matches all cases. If you have more complicated US dresses it could be that there is not a single or easy RegEx solution.

first group :
(^\d{0,1}\D{1,}+\d{0,1}\D{0,}+) => at the start of the string a number of lengt zero to 1 or a non-digit with length 1-n and a number lengt 0 to 1 and a non-digit character of length 0 to n.

^ => start of string

d => digit [0-9]
D => non digit

split_regex2.knwf (12.0 KB)

2 Likes

Hello @mlauber71,

thanks for the reply! Your second suggestion worked like a charm.

Cheers,
~gujo

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.