Regex Split

#1

Hey All! I am going crazy over a Regex issue (again). I generally have pretty good luck with Regex but sometimes getting it to work properly in KNIME presents difficulty.

I have a cell this this text on a single line:

[Smith, Jerome (Specialist)], [Smith, Bo (Manager)], [Jones, Joanna (Business Partner)], [Smith, Loretta (Talent Acquisition Partner)], [Jones, Lori (Supervisor)], [Jones, Kristen (Partner)], [Jones, Tina (Partner)], [Diaz, Teodora (Manager)], [Diaz, Kellie (Manager)], [Smith, Christopher (Supervisor)]

I am using this regex (which works fine in RegexPal and Regex 101) in the Regex Split node (Version 4.0)

([(?:\w*|\w*-\w*|\w*\s\w*)(?:,\s\w*\s(Manager)]))+

I just want the “Manager” entries to be split out of the cell into their own cells but for the life of me I cannot get it to work. It will sometimes work if the data is at the front of the line or if there are two entries next to each other. I just don’t know what I’m doing wrong.

0 Likes

#2

The problem with using the Regex Split node is that it creates N new columns where N is the number of groups defined by your regex; unfortunately, the regex satisfying a variable length match is by necessity a variable length regex (e.g (\[[^\]]*\(Manager\)\])+).

I’ve attached a Java solution, though perhaps someone can do a better, non-coding workflow.
tardis_regex.knwf (7.7 KB)

4 Likes

#3

@quaeler

Thank you! This is fantastic! It’s a shame the Regex Split node can’t do it.

0 Likes

#4

Hi @TardisPilot,

You have a string that may contain different formats (different number of managers in different orders).
To split a string using Regex Split node you need to know the format of the string and input the proper pattern. It is easy to use the example format and split the string using a Regex Split but I’m pretty sure that’s not what you are looking for.

What you are trying to do is what String Manipulation and Cell Splitter nodes can do.

Use this expression in a String Manipulation node:
regexReplace(regexReplace($column1$, ".*?(?<=\\[)([a-zA-Z, ]*)(?= \\(Manager\\))|.*", "$1_"), "_[^a-zA-Z]+", "")

And then use a Cell Splitter node with “_” (underscore) as the delimiter to split the names.

:blush:

2 Likes