Search with searchpattern and get postion and founded value

Hi Guys,

I need your help again! :slight_smile:

Now I have the case that I want to search with a search pattern and want to get the starting postion and the founded values in separate columns.

Example Search Pattern: <maybe more text><number><number><. or ,><number><maybe more text>

In Excel VBA I used to search for that with the search pattern: "*##[.,]#*"
In RegEx I used the search pattern: "(.*)[0-9]{2}.[0-9]{1}(.*)"

So my problem to find a complete best practice way to handle that topic in KNIME.

This is how it should look in theory:

Thank you guys in advance!

BR

Andre

Hello @AndreP

You can test the following codes within a String Manipulation node:

term:

regexReplace($value$, ".*?([.,\\d]+).*", "$1")

position:

length(regexReplace($value$, "(.*?)([.,\\d]+).*", "$1")) + 1

BR

3 Likes

Hi @gonhaddock,

than you. I train myself in regex a little bit more and modify your formular a little bit. Because if you have more then one “,”, i got wrong results. Now it’s working perfect.

thanks

But one think I don’t understand.
.* means zero to multiple of any characters
? means zero to one of any character.

Why I need to combined it .*? at the beginning?

BR

Andre

Hello @AndreP
I see in the las row of your example:

But yes, regex becomes a bit picky with casuistic, the code was developed for displayed cases; congrats for dealing with the necessary upgrades to make it work.

A valid alternative, adding a space character ‘\\s’ before the digits group…

term:

regexReplace($value$, ".*?\\s([.,\\d]+).*", "$1")

position:

length(regexReplace($value$, "(.*?\\s)([.,\\d]+).*", "$1")) + 1

‘.’ matches any character (except for line terminators)
‘*’ matches the previous token (‘.’) between zero and unlimited times, as many times as possible, giving back as needed (greedy)

At the beginning (for the 1st capturing group), we aggregate ‘?’ aiming to make it lazy instead of greedy (as by default); if greedy, it will take included until the last occurrence of next group. Because the ‘next capturing group’ has been defined with square brackets, is the occurrence of ‘any’, then because of greedy character, it will include all the characters of the next group (targeting last group if many) but the latest character.

In lazy mood it will respect the first occurrence of a character defined in next capturing group.

BR

2 Likes

Hi @AndreP, I thought I’d continue what @gonhaddock has correctly said about the lazy/greedy mode, and describe what it means using your example data. @gonhaddock is using the more correct terminology such as “gives back” which is more technically accurate than I shall write here in terms of how regex works, but I thought I’d write it in terms of what the effect is rather than how it achieves it.

You are correct that ? can mean “optional” (i.e. zero or one occurrences) but it can also have a different meaning depending on where it appears in the regex. When it follows + or * it is the “lazy quantifier”.

Ordinarily .* would match as many characters as possible (greedy) whilst still adhering to the full regex pattern that follows. Whereas .*? makes it “lazy” and means it tries to find the fewest matches possible whilst still adhering to the pattern that follows.

Taking your string “EU 44,5” as example. You want to collect the 44,5 which has been identified as a sequence of numbers, commas and potentially periods.

The stated regex pattern of “.*?([.,\\d]+).*” will match as follows:

.*? matches the minimum (it is “lazy”) number of characters that allows the successful completion of the rest of the regex expression.
=> so it matches "EU "
(i.e. stopping prior to the first character that can legitimately be matched to the remainder of the regex expression)

([.,\\d]+) matches and captures a sequence of one or more periods, commas or digits, matching as many as it can (it is greedy)
=> so it matches and captures “44,5”

Incidentally, if this had been also made “lazy”, by specifying ([.,\\d]+?) , it would have only captured the first “4”, because the remainder could have been successfully matched by the “greedy” .* that follows it.

.* matches 0 or more characters following what has already been matched, again matching as many as it can whilst allowing the rest of the remainder of regex to succeed (it is greedy). Of course there is no more regex following this, but there are no characters left anyway…
=> so it matches nothing

Conversely,

A regex pattern of “.*([.,\\d]+).*” will match as follows:

.* matches the maximum (it is “greedy”) number of characters that allows the successful completion of the rest of the regex expression.
=> so it matches “EU 44,” because this still allows the next piece of regex to successfully match (the next part of the regex only has to match 1 character to be considered successful!)

([.,\\d]+) matches and captures a sequence of one or more periods, commas or digits. It is greedy, so matches as many as possible.
=> so it matches and captures the one item that has been left available to it… “5”

.* matches 0 or more characters that follow what has already been matched. There is nothing left.
=> so it matches nothing

For more info search for “regex lazy and greedy quantifiers”, or see here

and you can also test this specific example at

(but remember that when typing regex outside of a KNIME string, you only have a single \ in front of the “d”. Alternatively you could use .*?([.,0-9]+).* since 0-9 here is synonymous with \d, and lose the need for having to put in the \ completely.)

2 Likes

Hi @gonhaddock and @takbb ,

thank you very much for this details! Very helpful.
I try to train myself more and more in regex. but sometimes i am still a little bit confusing for me.

So I will ask you some use cases and hopefully you can help me.

Case 1: I try to find numbers with an “c” at the end of the number. But only if after the c is NOT coming a other character.

regexMatcher($values$,“.?([0-9]{2,3}[c]{1})[^a-z].”)
regexReplace($values$,“.?([0-9]{2,3}[c]{1})[^a-z].”,“$1”)

case 2: I try to find numbers 1 to 4 digits long with optional up to 2 decimal place.

regexMatcher($values$,“.?([0-9]{1,4}[./]?[0-9]{0,2}).”)
regexReplace($values$,“.?([0-9]{1,4}[./]?[0-9]{0,2}).”,“$1”)

How would you handle it?

BR

Andre

Hi @AndreP , the problem I can see you have here is that on the face of it, in String Manipulation, the regexReplace and regexMatcher don’t appear to do quite what you want.

regexMatcher can return True or False indicating that the regexMatched, but regexReplace appears incapable of giving you back only those that do match.

regexReplacer will replace the part of the string that matches your regex pattern, or else will return the who original string if it doesn’t match, whereas you want it to return missing if it doesn’t match

So one way to make it behave how you want is to nest the regexReplace inside a “replace()” The inner regexReplace can replace a string which matches your regex with an empty string, and then the outer replace can replace any string which matches what was returned by the inner regexReplace,
with an empty string. Finally an empty string can be returned as “missing” using the toNull function.

e.g. case 1:

toNull(
	replace($column1$, 
		regexReplace($column1$,"^([0-9]+c{1})$","" )
		,"")
	)

image

case 2:

toNull(
	replace($column2$, 
		regexReplace($column2$,"^[0-9]{1,4}(?:\\.[0-9]{1,2})?$","")
		,"")
	)

the above for my decimals… just spotted that you have “,” as decimal, so you’d need to use this…

toNull(
	replace($column2$, 
		regexReplace($column2$,"^[0-9]{1,4}(?:,[0-9]{1,2})?$","")
		,"")
	)

image

How does it work?

Well… before I get into that, I should state for the record that I cheated and I asked chatGPT for the regex that met your requirement :wink:

Ok…quick explanation of the functions…

the inner regexReplace will return “” if it matches your regex, but will return the entire string if it doesn’t.

This means that the outer replace will be executing one of the following:

  1. If a string matches, e.g. “110c”
    replace(“110c”,“”,“”)
    which will return 110c

  2. If a string doesn’t match e.g. “110cm”:
    replace(“110cm”,“110cm”,“”)
    which returns “”

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.