How to split string, through Regex by rows

Esterojaz · August 25, 2019, 4:52pm

Hello Community, I am relatively new in Knime Environment. So I would like to have your help.
I have multiple cells with long text strings formed by comments, but I need to separate each comment in a different row. Then, if one record has 5 comments I need 5 different lines, one per each comment. Fortunately, exist a pattern at the beginning of each new comment which is:
dd-MM-yyyy HH:mm:ss - Name (Work Notes)
08-05-2019 08:33:09 - Esteban Zeledon (Work Notes)

I have tried different options, but I still don’t find a solution. I would really appreciate your help

One Example is below:

08-27-2018 07:24:19 - Esteban Zeledon (Work Notes)
An email has been sent to the user with the answer.

Closing the case.

08-25-2018 06:59:21 - Jose Rojas (Work Notes)
The information has been updated

08-23-2018 09:43:10 - Jose Rojas (Work Notes)
Please refresh the price list

armingrudd · August 25, 2019, 6:19pm

Hi @Esterojaz and welcome to the KNIME community forum,

Just to make sure we can provide you with a precise solution, would you please upload an example dataset here? And please create and send an example of what you expect as the output as well.

The example that you have provided is fine but I need to know how exactly the input looks like.

Thanks.

Esterojaz · August 25, 2019, 7:03pm

Hello @armingrudd Thank you so much for your help. Attached you can find an example of an Input and the desire Output, please let me know if it works for you.
Thanks!

Example.knwf (14.0 KB)

armingrudd · August 25, 2019, 8:12pm

Here is the solution:
Example.knwf (29.8 KB)

To do that, first, I put a delimiter “___” (three underscores) between each comment using this expression in a String Manipulation node:
regexReplace($Comment$, "(?i)\\n([\\d :-]*\\s-\\s[a-z ]*\\s\$work notes\$)", "___$1")

Then I used a Cell Splitter node to create a list containing the splitted comments. And finally, an Ungroup node is used to put each comment in a new row.

Feel free to ask further questions.

Edit: I modified the regex in my post a bit. The regex pattern inside the workflow works fine as well.

Esterojaz · August 25, 2019, 8:59pm

Its incredible. It works as expected. Thank you so much! I really appreciate your kindness and knowledge, thanks for sharing!!!

Esterojaz · August 28, 2019, 10:47pm

Hello Again Armingrudd, I have performed some test, so I identified a limitation in the Regex Logic, since it split even when there is a number chain after a break “\n”. I think that, it understands like a date format, although it is not.
If you want to help me with that, it will be great, to make the logic perfect.
Also, in order to learn how Regex works, if you want, I will appreciate if you explain the syntax used between each parenthesis.

Attached you can find the workflow with the cell that is presenting error.

Thanks in advance!

Example.knwf (25.2 KB)

armingrudd · August 29, 2019, 3:58am

Dear @Esterojaz,

You just have to replace the \\s inside square brackets with a space character:

regexReplace($Comment$, "(?i)\\n([\\d :-]*\\s-\\s[a-z ]*\\s\$work notes\$)", "___$1")

or

regexReplace($Comment$, "\n([\\d :-]*\\s-\\s[a-zA-Z ]*\\s\$Work Notes\$)", "___$1")

In the first regex pattern, (?i) makes the regex pattern case insensitive, the pattern inside the parentheses is caught to be used for the replacement ($1). Inside the square brackets, the first pair catches any numbers, spaces, colons or dashes, while the second pair catches any letters and spaces.

Esterojaz · August 29, 2019, 3:22pm

Thank you so much @armingrudd. The Regex Logic is perfect!

I wish you all the best!

system · September 5, 2019, 3:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.