regex insert seperator if meets condition

Hi all,

can anybody give me some hints to solve my little (?) regex problem . . .

I have a string like this:
Net Debt;-;-;-;-;-;-;-;Net Cash position;280;-;336;-;-;-;-;Leverage (Debt / EBITDA);-3,09x;-;-4,79x;-;-;-;-;Free Cash Flow1;96,9;75,3;16,0;265;108;79,1;95,3;

I test the string with regex (e.g. String Manipulation node): “;[A-Z]”

So, if it matches the regex condition (e.g. “;N” or “;L” or “; F”) I want to insert a seperator e.g. “###” just before the capital N, L and F (the real dataset many capital characters)

Output must be:
Net Debt;-;-;-;-;-;-;-;###Net Cash position;280;-;336;-;-;-;-;###Leverage (Debt / EBITDA);-3,09x;-;-4,79xx;-;-;-;-;###Free Cash Flow1;96,9;75,3;16,0;265;108;79,1;95,3;

I tried regexReplace but (of course) it replaces e.g. “;N” by my seperator (I must keep “;N”).

With the Regex Extractor I get an extra column with the relevant character combinations. I can take that for further manipulation but it’s a bit cumbersome

So, is there a direct method to realised my wished output?

THNX in advance

Try string manipulation with

regexReplace($column1$, ";([N|L|F])","###$1" )
2 Likes

Hello @sanderlenselink
You can test the following code within a ‘String Manipulator’ node (one single node solution);

substr(
	regexReplace($text$, "(?:(.*?[\\S+]+[\\s+]?)[;])", ";###$1")
	, 4
	, length( regexReplace($text$, "(?:(.*?[\\S+]+[\\s+]?)[;])", ";###$1") ) - 4  
	)

The substr() ‘function’ has been allocated aiming to remove the separatot (4 characters) in the first position.

Otherwise you can use two consecutive nodes with the following codes:

regexReplace($text$, "(?:(.*?[\\S+]+[\\s+]?)[;])", ";###$1")
substr($regexresult$, 4, length($regexresult$) - 4 )

BR

1 Like

@Daniel_Weikert and @gonhaddock . . . you both thank you very much!!!

Regarding Daniels’s solution I modified it a bit regexReplace($column1$, “;([A|B|C|D|E etc …|Z])”,“###$1” ) A bit quick & dirty but as a non-programmer it covers all possible situations.

Gon’s solution is more sophisticated but the result is the same as my (quick & dirty) ABCDE . . . XYZ modification.

A little additional question . . .
For testing Regex I use https://regex101.com/ . . . see screenshot 1. The code \D{1}+\d; matches partly what I need. However in the String Manipulation node KNIME doesn’'t find “EPS2”.
Any idea why?

To be clear, what I need exactly is that if text precedes “2;” the “2” will be removed.
The difficulty is that the string contains also values like “2022”.
So, EPS, Share and XYZ will be produced instead of EPS2, Share2 and XYZ2

Because of my lack of Regex . . . knows someone a good website or book/handout to understand (learn) Regex much better. Or is it just practice, practice and practice :smile:

THNX in advance

Screenshot 1

Screenshot 2
image

Hello @sanderlenselink
The best solutions for me are the ones that i understand the better, to feel comfortable with it. So in your case it is logic your choice.

Say so and generally speaking, when growing into complexity (not the case), too much literality drives into a less robust solutions…

Moving into your last challenge, I’m not sure if I fully understand it. Would this coding match your expected output? You want to remove only the last digit at the end of the string. Even if there are more than one digit it has to work. Is it?

I’ve replaced EPS2 with EPS2021 as you see in the picture…

Sometimes It si more about isolate wat you want to keep $1 and $2 (middle group $2 is replaced with ‘;’); Then you need to make groups with the whole text. Once in String Manipulation, regexReplace works like intra-cell ‘split and merge’ loop function :thinking: Does it make sense?

Next question is regarding the translation from regex101 to KNIME. Aiming to make it work, as you are coding java into ‘String Manipulation’ node; you need to add scape bars for the reverse slashes, otherwise they are interpreted as scape bars. The code text turns into magenta once validated within String Manipulation node; as you see in your picture, text color is in grey.

You can test with:

regexReplace($text$, "(.*?\\D+)(\\d+;)(\\S+)", "$1;$3")

BR

1 Like

Regarding the translation, on the website you can also click on Code Generation, select Java and copy the content of final String regex. That is directly compatible with KNIME.

Example:

1 Like

@ArjenEX
Do you think ChatGPT can do it? We would need another AI to translate the logics to it :joy:

Thank you for the tip, I didn’t know about that functionality.

Very likely, GPT knows KNIME pretty well. If you ask for a particular use case it will draft a workflow in terms of certain KNIME nodes to use.

@gonhaddock . . . thnx for you explanation and example

the solution you suggest is not fully what I’ve in mind (my fault, I was not clear enough).

The actual and target situation is like this:

There is a kind of name followed by (in this example) 3 values. Sometimes (but not always) a digit is connected to the name and even a digit with parentheses. I want to get rid of those numbers and parentheses as you can see in the Target string.

Regarding . . .

I doubt whether I understand what you would like to say . . . I’m sorry :bowing_man:

Data:
Company ACTUAL String
Period;2021;2022;2023;EPS2;15;16;16;PE;2;3;4;EV(1);500;600;700
Period;2022;2023;2024;EPS;12;16:13;EV(2);1000;2000;3000
Period;2020;2021;2022;EPS1;9;8;7;EV:300;200;100

TARGET string
Period;2021;2022;2023;EPS;15;16;16;PE;2;3;4;EV;500;600;700
Period;2022;2023;2024;EPS;12;16:13;EV;1000;2000;3000
Period;2020;2021;2022;EPS;9;8;7;EV:300;200;100

@ArjenEX . . . thnx for your hint . . . little by little I’m growing :wink:

Hello @sanderlenselink

No matter. I was just typing wile thinking…

  1. As you see these regex codes are quite sensitive, so providing detailed examples input-output matters.
  2. Better to open a topic for each challenge, afterwards it would be a difficult problem for you to assign solution badges

Said that, this may work…

regexReplace($text$, "(EPS|EV)(.*?;)", "$1;")

BR

PS.- This one for the colons’ typo in the latest EV

(EPS|EV)(.*?[;:])
1 Like

Hi guys,

Some tips…

\d+ means digitals numbers with 1 or more numbers…
\w+ means words/letters -_ with 1 oir more chars…

in your case, you have both together… \w+\d will wait some letters and numbers combined…

BUT I go a step before…

If you have this information from a file, use de CSV to bring it as columns, it can solve a lot of problems about it and you can see a specific column from a value or test it better…

OR can use the Cell splitter node to break it with the “;” as separator too…

What do you think? better to solve it?

Seeya,

Denis

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.