Pattern sequence string recognition and extraction

KNIMEadventurer77 · May 30, 2020, 3:33am

Hi everyone!

I am curious if anyone is aware of a method to do the following:

Analyze a column with rows filled with strings such as:

Row 1: 77XYZ21
Row 2: XYZ277
Row 3: 244XYZ
Row 4: 7889922

“Learn” anytime a sequence of characters within a string is repeated in more than one row. In this case it would be XYZ as seen in rows 1, 2, and 3.

Extract these sequences into a new column.
Then I would like to run an analysis of the prevalence of each extracted sequence in the original column. In the simple four row example I provided, XYZ would be prevalent in 75% (3/4) of records.
Finally I would then like to identify all rows containing strings where there is no other row containing that sequence (ie completely unique).

Thanks!

armingrudd · May 30, 2020, 5:06am

Hi KNIMEadventurer77 and welcome to the KNIME forum,

You can use the Regex Extractor node to extract the sequence and then the GroupBy node calculates the percent and count (where count = 1 means unique).

Here is an example workflow:

24245-1-1.knwf (28.0 KB)

Let me know if you have any other questions.

FYI To have Regex Extractor node, you need Palladian for KNIME extension which you can install from this update site (for KNIME 4.1.x):
https://download.nodepit.com/palladian/4.1

system · June 6, 2020, 5:06am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.