RegEx Splitting to Form Lists - Repeating Groups

CobusSmit · August 31, 2018, 7:41am

Hi,

I’ve been following the RegEx discusions on the forum - I am a bit challenged by the repeating group approach. I have a basic problem where I have a list of items that I need to split up.

Example of code list: SVV11283687 , ICC472987789, EVT3276428373

A simple Cell Splitter could do the job and use a “,” to split out a list - problem is that the above example is the best case scenario. The majority of cases I deal with has “junk text” thrown in around the codes - but the code patterns (e.g. SVV00000000) remain constant. I have tried focusing on the separators - and cleaning the text around the codes, but the variation is huge and creates problems where the sequence of cleaning needs to also be considered. My “gut feel” tells me it would be better to just extract the codes.

RegEx has the notion of a repeating groups - and Iimagine I need to identify groups to create columns that I can later concatenate to form a list that I can split out. Problem is that RegEx Splitter doesn’t seem to provide a nice way to get the following result:

Example
<Group 1> SVV11283687 <Group 2> ICC472987789 <Group 3> EVT3276428373

Does anyone have a view on how to get this broken up into groups so I can make a list? (Note: Some entries have more than 3 codes - could go up to 9 codes with varying order).

MH · August 31, 2018, 2:09pm

I think you have generally two Options.
You can use a Java snippet and extract your Groups with the Pattern class:

Pattern p = Pattern.compile("[A-Z]{3}\\d{8,}");
Matcher m = p.matcher( YOUR_TEXT_COLUMN);
List results = new ArrayList();
while(m.find()){
results.add(m.group());
}

or if you prefer to do it without script cheating. You can use the Text processing nodes.
1.) Convert your text to document Format (Strings to document)
2.) Use the Wildcard Tagger to tag your Groups with the regular Expression pattern
3.) Filter the Groups with the Tag Filter Node by filtering the tags you assigned in the Wildcard Tagger Node
4.) Use Bag of words to get your individual Groups (use Term to String Node to convert it back to a normal string).

Hope this helps.

CobusSmit · September 1, 2018, 7:20pm

Thank MH!!
I went with the scripting option you suggested. Works really well. Seem like it could even become a useful standard KNIME node in future - RegEx Splitter should also be able to do this (as a hypotentical configurable option: ) ) , but scripting will suffice. I will give the Text Processing a bash as well.

system · September 8, 2018, 7:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.