Regex challenge . . .

sanderlenselink · September 15, 2023, 2:26pm

Hi all,

I’m not that strong with Regex and therefore I beg your help.

I have a long string and want to split it. The lenght of te string varies day to day . . .

Start string e.g.
NOVC.DUNovo Nor - disk A/SSep 13, 2023N2.00 - 1.00XNONOFNovo Nordisk A/SSep 13, 2023N2.00 - 1.00KOVC.BENovo Nordisk A/SSep 13, 2023N2.00 - 1.00ZNNW.DUNaspers LtdSep 13, 2023N5000.00 - 1.00

End solution
NOVC.DUNovo Nor - disk A/SSep 13, 2023N2.00 - 1.00;
XNONOFNovo Nordisk A/SSep 13, 2023N2.00 - 1.00;
KOVC.BENovo Nordisk A/SSep 13, 2023N2.00 - 1.00;
ZNNW.DUNaspers LtdSep 13, 2023N5000.00 - 1.00;

My mental approach . . .
(1) find the position of " - " because this is unique if it is between digits
(2) after knowing the position of (1) I need the position of the first alpha numerical character (in the example “N” but and then "X’ , “K” cq “Z” . . . important to know that that first alpha character varies from A-Z))
(3) insert a seperator (e.g. "; ') just before the position of (2)

Hopefully someone has an idea how solve this?

-Sander

gonhaddock · September 15, 2023, 2:58pm

Hello @sanderlenselink
You can test the following code with a String Manipulation node to insert the ’ ; ’

Single line:

regexReplace($column1$, "(-\\s\\d\\S{3})([A-Z]+)", "$1;$2")

‘Cell Splitter’ and unpivot of arrays can be used afterwards, for multi-row split goal

Multiline:

regexReplace($column1$, "(-\\s\\d\\S{3})([A-Z]+)", "$1;\n$2")

BR

P.S. Sequence explanation:
1st capturing group $1: one dash -, one white space character \s, one digit \d ,three non white space characters \S{3}
2nd capturing group $2: any sequence of characters in range A-Z [A-Z]+

Replacement: $1;\n$2,

sanderlenselink · September 15, 2023, 4:22pm

Hi @gonhaddock

thnx for you supurb fast feedback . . .

It works in most cases but with “\S{3}” the script assumes that after " - " follows alway 3 non white space characters.

However, sometimes there follow 4, 5 or 6 non white space characters. The problem is that how many n-w-s characters follow.
(therefore I was thinking about rule (2) of my mental approch and counting the position of the respective characters)

So, could there be a general solution irrespective the number of n-w-s characters?
Or e.g. less than let’s say 10 characters

-Sander

= = =
I modified the example a bit with dynamic n-w-s characters

NOVC.DUNovo Nor - disk A/SSep 13, 2023N2.00 - 1.00XNONOFNovo Nordisk A/SSep 13, 2023N2.00 - 1.004KOVC.BENovo Nordisk A/SSep 13, 2023N2.00 - 1.0034ZNNW.DUNaspers LtdSep 13, 2023N5000.00 - 1.0320

gonhaddock · September 15, 2023, 4:54pm

Hello @sanderlenselink
I’ve modified the sequence by modifying the \S{3,}? quantifier adding a comma (3 or more of \S) and making it lazy with a ‘question mark’

The modified code looks like this:

Multiline:

regexReplace($column1$, "(-\\s\\d\\S{3,}?)([A-Z]+)", "$1;\n$2")

regex101.com

BR

sanderlenselink · September 16, 2023, 6:13pm

Hi @gonhaddock

This is really great !

It looks fine for me to continue with my flow. Regex is magic but requires a lot of practice.

However, your solution is a perfect learning stuff.

THNX

system · September 23, 2023, 6:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.