Tricky Loop

Hi,

I would like to know if Knime is able to manage this kind of tricky loop.

The goal is to extract all the words THE and the following word from a free text.. I'm able to do this using several combination of RegexSplit and StringManipulation (please see the attached worflow).

I would like to get the same result using a loop without repeating N times (with N = the max number of 'THE' in the sentence) the combination of nodes RegexSplit and StringManipulation.
Please could you help me?

Thanks in advance.

This is IMO not a case for Loop but is best tackled with a Bag of words type approach because it is fastest and easiest to maintain. I've checked your workflow, let me just detail my solution again (step by step this time):

- String Manipulation: regexReplace($Text$, "( )+(the|The|THE)+( )+", "_the_") -> if you want only the following word together with the, then drop the ( )+ before (the|...);

- Cell Splitter: delimiter = <space character> , output as new columns ;

- Unpivoting: Value Columns, use wildcard *Arr*  and for Retained Columns, keep Text ;

- String Manipulation: toBoolean(indexOf($ColumnValues$, "_the_")!=-1) -> identify the rows containing "the", append as new variable, e.g. contains_the ;

- Row Filter: keep the rows with true for the previously created variable ;

- If really important to you, use String Manipulation to replace _ by space again (looks probably nice): replaceChars($ColumnValues$, "_", " ") 

- If you want to go back to a structure where each text represents a single row, you can use Pivot thereafter: Groups = RowIDs and Text, Pivots = ColumnNames, Manual Aggregation = ColumnValues, keep original names;

P.S.: it looks like you may want to clean the text from punctuation and alike before the whole process, i.e. if you want to have a clean output.

Many thanks Geo!

I've opened this new post to understand how to perform this kind manipulation with loop because I have other kinds of data manipulation that I need to do using loop and I would like to learn how to do this things with loop. Moreover the content of my Text can be very long (I have only posted some short fake data examples) and I think your approch of splitting each word and pivoting and unpivoting works well with my fake example but could be very Memory/CPU expensive on a large dataset with long texts. But I appreciate a lot your effort!

Do you (or some else) know how to use a loop to avoid putting several  RegexSplit and StringManipulation nodes?

Thanks in advance!

Use the recursive loops

Hi aborg, thanks.

As you can see in the workflow attached I tried to use the recursive loops but I didn't get the expected result.

Please may I ask you to verify what I'm doing wrong?

Thank you in advance

Hi iiiaaa (I always wondered what this stands for :))

You need to be careful what you send back. As you are sending back the split_0 columns, in the second iteration the regex split renames the split_0 column to split_0(#1) and the String Manipulation node replaces the text with the split_0 value (which already was replaced in the first iteration)

Also you are using a Quickform Column Filter instead of the real one.

My suggestion: connect the first inport of the loop end to the string manipulation output. And connect the second one to a column filter providing you only with the text. You could even filter all empty strings and finnish the loop when the table is empty.

Cheers, Iris

I couldn't resist, the corrected workflow is attached.

1 Like

Hi Iris,

do you have something against my very beautifull name? I'm joking it's only a "quick to type" nickname ;)

Thank you very much. As usual you are very effective!

Cheers

 

Now I'd be interested in a performance comparison (cf. Pivot - Unpivot vs Loop), just out of curiosity :-)

@iiiaaa, nothing against!!! It's just my initials, but repeated, so ;)

 

@Geo: and did you make a performance comparison?

@iris: No, I haven't unfortunately, absent the large dataset to which iiiaaa referred. Would you happen to have a guess about the expected outcome, Iris ?

P.S.: I've attached the workflow for the alternative approach.

Thank zou very much geo. Your approach is really interesting. I didn't use it only beacause the original punctuation (comma, dot, etc.) is replaced by a space. But it's a great approach!

I agree, once you want to keep all those punctuation characters, the first regex becomes pretty complex. The best remedy I can come up with is regexReplace($Text$, "(the|THE|The){1}([\\p{P}])*([ \\$]|$)+", " $1$2_") for the first String Manipulation and then regexMatcher($ColumnValues$, "(the|THE|The)+(\\p{P})*(_)+(.)*") in the second String Manipulation.

Here attached you'll find yet another alternative (KNIME's text processing in action). It basically protects one word including punctuation and space characters following the word "the" and then filters out the rest of the text. The regex in the beginning is the only complex item.