String Manipulation Regex?

Hello,
Trying to identify the way to split columns to identify the municipality:
Example:
Sint-Willibrordusplein 4 3550 Heusden-Zolder
Henri Horriestraat 31 8800 Roeselare
Dreef 1 3220 De Holsbeek
Arthur Dezangrélaan 17 1950 Kraainem

In the example above I put the municipality in bold. It is always after the 4 digit Belgium postal code.
Do you know how this can be solved? Using Regex?

Thank you!

By the way, Knime is fantastic. Keep the fantastic work!

1 Like

Hi bluena,

I recommend to have a look at the Palladian toolkit plugin. It offers a user-friendly Regex Extractor node with instantaneous preview which allows you to build regular expressions for such cases very easily (see here for more details about the recent release).

Here’s an example in action:

The regular expression which I used here is:

.*\d{4}\s(?<city>.*)

This means:

  1. capture arbitrary characters,
  2. followed by exactly four digits,
  3. followed by a space,
  4. followed by the city – this is in brackets so that I can use the capturing group as the output column name city

You can of course set additional capturing groups to split the street, zip code, …

You can find the workflow on my NodePit Space:

Hope this helps!

Philipp

8 Likes

Thank you for the clear and useful answer! This Regex Extractor node will for sure be useful in the future. Out of curiosity, once the regex is identified (in our case " .\d{4}\s(?.) ", how can we use it in the “Regex Split” node? I tried with no luck. Thanks again Philipp!

Hey Bluena,

You can use the following line with the string manipulation node:
regexReplace($column1$, “.*[0-9]{4}(.*)”, “$1”)

make sure to replace $column1$ with your column name

5 Likes

Thank you Nicks


Are you sure about the $1 at the end?

the $1 should be within quotes “$1” , as indicated above

1 Like

Yes tried that as well but get also an error

.

your quotes on the first expression are not correct. Try typing them again on your own and don’t copy paste them from the post

1 Like

Indeed, with good quotes the RegEx runs smoothly :slight_smile: Thank you!
However the result is not as intended. I’m only interested in the Municipality that is what’s after the 4 digit postal code. For line 1, that would be only “Brugge”. Feels we’re almost there!


(The screenshot above shows the result of the above Regex, not the desired result)

Still the expression is not correct :stuck_out_tongue: You miss the two asterisks.
Check the screenshot below to make it more clear:
image

2 Likes

Ooops! Thank you Nicks for your patience!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.