Normalizing Incomplete Text Fields and gathering counts


I'm having a challenge to process and perform counts on text fields. I have about ~100k rows, Below is a snapshot of the data set

DocNo          Section Reference                              Description

111                                                          This is case1

124                Chap 4 Section                      This is case2

100                Chapter 2 Section 2                            This is case3

1985              S2.7.2.2                                                 This is case4

The objective is the find the count of Section References for each document, Presently I'm using Regex Expression to capture valid refereces (like ) and using cell splitter to put them in seperate columns and doing the counts with drill down (using Tableau Hierarchy feature), but the I'm not sure how do I handle case 2,3 and 4 with all the text fields in it.

Is there a good(or naive) solution in KNIME I can implement, like fuzzy match etc. and get counts for all section references, it doesnt have to be efficient or optimized.

Appreciate any leads, Thanks!

Hi Mohammed,

you could use the Regex Split node to extract groups defined via regex e.g. ".*(\d\.\d\.\d\.\d).*". This allows to extract more then one valid references in one string, and also to handle references like in document 1985 However, it will not fix your problem for document 100. Therefore you would need another regex.

Kind regards,


Thanks Kilian. Partially solves my problem. Let me give this a try.