Split by regex

kowisoft · December 6, 2021, 11:11am

Dear KNIMErs,

I have a table that looks like this:

I want to achieve, that the names in column A are “split up” in different rows.

Cell A2 should read: Smith, John
Cell A3 should read: Miller, Peter
Cell A5 should read: Doe, Frank
Cell A6 should read: Davies, Rhonda

Like this:

Some things to note here:

the differentiation between name pairs seems is a comma (",")
the differentiation between first name and last name is a comma and a space (", ")
I already was able to create additional row copies by counting the number of occurrences of the sequence (", ") → comma and space and by using the One to Many Rows Node

Any idea is highly welcome, thanks a lot in advance

Here’s the Excel list:
list-of-names.xlsx (10.6 KB)

gonhaddock · December 6, 2021, 11:20am

Hi @kowisoft
I haven’t got access right now to my PC to test it but, two ‘Cell Splitter’ in a row could do the job.

the first splitting by ", " and replacing by “”
the second one splitting the remaining column with the “,” …

BR

kowisoft · December 6, 2021, 11:26am

Thanks a lot @gonhaddock

The problem is, this splits the cells into new columns but I need separate rows, basically one per split.

ipazin · December 6, 2021, 12:09pm

Hello @kowisoft,

idea is to use Rank node (Ordinal mode) followed by regexReplace() function in String Manipulation node which detects comma which is not preceded nor followed by space and takes first or second name based on previously calculated rank. Here is workflow example:
2021_12_06_Regex_Name_Extraction.knwf (18.6 KB)

In case you can have more than two names in one row regex should be modified.

Br,
Ivan

Kathrin · December 6, 2021, 12:16pm

Hi @kowisoft,

and here a workflow with a different approach using the unpivoting node

Cheers
Kathrin

kowisoft · December 6, 2021, 1:29pm

Thank you @Kathrin

Works wonderful. I used your approach but received some strange duplicates (my “real list” is a little bit more complicated than the example).

But by using a clever combination of String Manipulation and the join function, I could get rid of those.

I guess @ipazin 's solution works as well but I had a hard time to understand the regex.

I try to learn it better so any recommendation of “how to learn regex” is highly appreciated. Took some Udemy courses but they were “meh” (or I was not clever enough, who knows… )

ipazin · December 9, 2021, 11:15am

Hello @kowisoft,

take a look here:

And for testing regex I use this page:

Br,
Ivan

kowisoft · December 13, 2021, 11:56am

Hi @ipazin

I just looked at your provided example and although it also works perfectly (of course )

I still wonder what the second part does.

I read Armin’s article about using Regex in KNIME but I do not understand the second part, the one that goes like

join("$",string($rank$))

My understanding is that this very part creates a “unique ID” created by a $ sign and the rank number. However, looking at the result, this is nowhere to be seen, instead, it outputs the (wanted) name.

While it delivers the desired result…

… I would love to understand how it works

Thank you!

takbb · December 13, 2021, 2:35pm

Hi @kowisoft, the value $1 or $2 in regex translates to “the contents of the numbered ‘capture group’”.

So if the regex has divided the value into separate capture groups,denoted in regex by enclosing each group in round brackets ( ), then replacing the captured groups with $1 will return the first captured group (e.g. “Smith, John”) and $2 will tell it to return the second captured group (e.g. “Miller , Peter”)

You can experiment for yourself…
If for example, instead of using join("$",string($rank$))
the code had said:
regexReplace($Name$,"(.*[^\\s]),([^\\s].*)" ,"$2 : $1" )

The effect of this would be to return the second capture group first, followed by a colon followed by the first capture group:

e.g.

I hope this helps with your $n question

kowisoft · December 13, 2021, 2:37pm

Awesome @takbb it indeed helps to solve the question.

So it basically works like a variable holding the content of the split.

takbb · December 13, 2021, 2:40pm

Yes that’s exactly it. They are like variables, or placeholders…

In some regex implementations I think I’ve seen \1, \2 taking the place of $1, $2, so you may come across variations such as that, but it works the same way

kowisoft · December 13, 2021, 3:34pm

Ok, now I am facing another problem (and start to feel really stupid, because it was going so well in the beginning )

The Rank Node / String Manipulation Node does this only twice, although I have up to 5 ranks.

I initially thought, that the number of how often the regexReplace is executed depends on the number of “dividers” that are in the name cell.

takbb · December 13, 2021, 4:37pm

Hi @kowisoft, I wouldn’t feel stupid about it… the problem is that the data here and the ability to use the rank node is slightly contrived as the number of rows that an occurrence of “Name” appears in, in your initial data set is directly related to the number of individual names within the data.

So the Rank node is then able to provide an ordinal ranking grouped by Name, so assigns 1 for first occurrence, 2 for second and so on. This allows it to work in this situation and provide $1, $2 as required to retrieve each of the individual names from the data.

However, it only works because of the repetition already provided in the supplied data.

If your data had started as simply:

A different approach would be necessary, as rank would just assign 1 to each row, and you’d never get the second name returned.

Ideally, we would want to be able to easily split up any number of delimited names into individual rows, but there are some limitations here with doing this with regexreplace within String Manipulation. As it stands in the above workflow, it only handles at most a pair of names. The presence of some commas in the names which are not delimiters of individual names makes it additionally difficult to split them up using the cell splitter, as it is difficult to differentate “different types” of commas.

It can be done with a java snippet though, and for convenience, I have wrapped this into a component which I have just uploaded to the hub (I based it on a slightly different component I uploaded back in June). I have used it in the workflow below.

The basis of this is replacing commas that have no white space around them with a different delimiter the pipe “|” character, Then a cell splitter can split the names on | and return them as a List data type, which can then be ungrouped into your individual names. As you can see, this then works without the data having to be initially repeated, and with more than two pairs of names in the sample data:

Sample Name splitting.knwf (25.1 KB)

bruno29a · December 14, 2021, 12:48am

Hi @kowisoft , I’m not an expert with Regex myself, but I’ve used this page to tutor myself about it, and I have to say, I’ve become much better since. It has helped me built some Regex for some of the solutions I have provided in the Forum. I’ve only started looking into it a couple of months ago.

@Kathrin , your approach is great, but I would not temporarily replace the commas (,) with a dash (-) since they are names, and names can have dash in them . For example, there are a lot of cases where children adopt both of their parents’ last names with a hyphen (dash) in between. Some have the hyphen for other reasons. It is quite common have a hyphen in the names. Some celebrities with hyphenated names:
Catherine Zeta-Jones
Julia Louis-Dreyfus
Joseph Leonard Gordon-Levitt
Rosie Huntington-Whiteley

These names would not work properly with your workflow. I would probably choose a more safe character instead, such as the tilde (~) or the pipe (|) like @takbb used.

EDIT: Sorry, did not provide the link for the Regex tutorial:
https://dl.icewarp.com/online_help/203030104.htm

Kathrin · December 14, 2021, 7:06am

@bruno29a good point! Thanks for pointing this out!

kowisoft · December 14, 2021, 7:38am

Thanks again to @takbb @bruno29a and @Kathrin

I finally got this working by using the Cell Splitter / Ungroup Node combination described by @takbb above. The key here was to split into a list.

I basically did the same as a preparation step as suggested by @takbb and @bruno29a - I just used String Manipulation Nodes to first replace the “, " (comma with space” with a “#” and the “,” (comma without a space" with a “§”.

Important side node: the sequence of the String Manipulation Nodes was important for me, as I first had to transform the ", " (w/ space) otherwise the second String Manipulation would have caught those as well.

This then made the split easier. A little bit of cleanup (column resorter, rename, filter) at the end and voilà, my list is split automatically.

Saves me another 30 minutes every month.

Happy KNIMEr here!!!

Thank you all guys!!!

ipazin · December 14, 2021, 12:05pm

Hello @kowisoft,

have seen your another (related) topic and question(s) here. Was planning to answer but Brian jumped in. (Probably better for you as I think he explains things very well). Anyways glad you made it work and tnx to @takbb for nice explanation and solution

Br,
Ivan

system · December 21, 2021, 12:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.