Split the Cells Contains Name and Title (String Value)

chococierry · March 25, 2023, 7:51pm

Hello, maybe it’s a easy question. I’ve already search in the forum but I still cannot solve the problem. I want to split the cell, but I have difficulty to do that. Here is my data (example):

Source:
|No|Class|Teacher|
|1|AB|Alexander, Ph.D. Dr.Ivan.|
|2|CD|Pita, M.B.A. Rosevelt, MSc. Prof. Donald, M.T.I.|
|3|EF|Prof.Dr.Sanny. Evelyn, M.B.A. Corazon, Ph.D. Soraya, M.Ak.|

Output
|No|Class|Teacher|
|1|AB|Alexander, Ph.D.|
|2|AB|Dr.Ivan.|
|3|CD|Pita, M.B.A.|
|4|CD|Rosevelt, MSc.|
|5|CD|Prof. Donald, M.T.I.|
|6|EF|Prof.Dr.Sanny.|
|7|EF|Evelyn, M.B.A.|
|8|EF|Corazon, Ph.D.|
|9|EF|Soraya, M.Ak.|

I can’t do it with Cell Splitter and I still confuse using Regex Split. Can anyone help me, please. Thank you.

ArjenEX · March 25, 2023, 11:49pm

Hi @chococierry

The names are written in many different ways which makes consistent cell splitting difficult. One way to approach it is through regex extraction by using the Regex Extractor node (Regex Extractor — NodePit ).

With this given small dataset, the following Regex generates the desired output:
(?:[A-Z][a-z].|Prof.)+[A-Z][a-z.]+\b|(?:[A-Za-z]+|Prof. )+, (?:[A-Za-z.]+)

Make sure to set the output to rows.

The expression might need further work to capture more nomenclatures depending on your full dataset.

Hope this helps!

chococierry · March 26, 2023, 12:17am

Hi @ArjenEX thank you so much.

My question is, what did you choose ‘Prof’ instead of other title?
And what if the name have more than one ‘word’, for example Alexander Jovanka, Ph.D. My mistake that in my previous example, the name only one single name.

ArjenEX · March 26, 2023, 12:21am

The ones that start with Prof were the only ones not captured by the main “rule” so I added that as a standalone OR statement. Hence also my comment about expanding the rule so it capture all cases. You can just keep stacking rules until you have captured all known nomenclatures.

chococierry · March 26, 2023, 12:49am

Thank you again @ArjenEX

But what if there are row that only contain one name? Let say the data gonna be like this.

Source:
|No|Class|Teacher|
|1|AB|Alexander Jovanka, Ph.D. Dr.Ivan Anderson.|
|2|CD|Pita, M.B.A. Rosevelt, MSc. Prof. Donald, M.T.I.|
|3|EF|Prof.Dr.Sanny. Evelyn, M.B.A. Corazon, Ph.D. Soraya, M.Ak.|
|4|GH|Mike Andrew, M.Ed.|

Output
|No|Class|Teacher|
|1|AB|Alexander, Ph.D.|
|2|AB|Dr.Ivan.|
|3|CD|Pita, M.B.A.|
|4|CD|Rosevelt, MSc.|
|5|CD|Prof. Donald, M.T.I.|
|6|EF|Prof.Dr.Sanny.|
|7|EF|Evelyn, M.B.A.|
|8|EF|Corazon, Ph.D.|
|9|EF|Soraya, M.Ak.|
|10|GH|Mike Andrew, M.Ed.|

Thank yoou…

ArjenEX · March 26, 2023, 11:37am

Hi @chococierry

That does not matter for the regex. Your latest output is a bit confusing though. You changed Alexander, Ph.D. to Alexander Jovanka, Ph.D. but the desired output stays Alexander, Ph.D.. is that correct?

If you want to capture all names in full, which I tihnk your objective is, you can use:
(?:[A-Za-z]+|Prof. |[A-Z][a-z]+ [A-Za-z]+)+, (?:[A-Za-z.]+)|(?:[A-Z][a-z]+[.][A-z]+[ |.][A-Z][a-z]+)

chococierry · March 28, 2023, 5:19am

Sorry, the desired output is extract the full name with its title.

chococierry · March 28, 2023, 5:21am

Thank @ArjenEX

If there any dictionary or library where I can learn about Regex Syntax? Sorry if I’m asking too much question. I am newbie with this thing. Anyway, thank you for your kind reply. It’s really help me.

ArjenEX · March 28, 2023, 7:04am

If it helped you please mark it as solution so that others can also benefit from this more easily

I can recommend https://regex101.com/. It also has a code generator feature that will create the correct syntax to use the expression in KNIME.

gonhaddock · March 28, 2023, 1:38pm

Hello @chococierry
The current @ArjenEX proposed solution, requires some more regex, aiming to tie back the Teacher’s list back to number and class columns. Mi final solution isn’t either simple, i had to delete previous post because it didn’t cover all occurrences in the expected output.

As the source text is not ordered (name-title order, punctuations…), it implies some literally in regex coding. Some more causalities can happen in your real data.

Alternatively you can test the following code in a ‘String Manipulation’ node:

https://regex101.com/

regexReplace(
	regexReplace(
		regexReplace(
			regexReplace($Teacher$, "(?:(Ph.Dr.\|Ph.D.\|Dr.\|MSc.\|M.B.A.\|M.Ed.\|Prof.Dr.\|Prof.\|M.Ak.)(\\S+)|(.*?$)\\r\\n?|((.+?)(Ph.Dr.\|Ph.D.\|Dr.\|MSc.\|M.B.A.\|M.Ed.\|Prof.Dr.\|Prof.\|M.Ak.)))|(.*?$)", "$1$6 $2$5$7;")
			, "; ;", "")
		, " ; ", "")	
	, ",", "")

After this regex manipulation, you’ll have to split it (Cell Splitter, ‘;’), and unpivot resulting arrays. Because of the workflow complexity, I’m uploading it, so you can inspect and edit it.

20230328_split_name_and_title_v0.knwf (36.5 KB)

BR

system · June 26, 2023, 1:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.