Normalizing Incomplete Text Fields and gathering counts

mohammedayub · September 29, 2016, 4:58pm

Hi,

I'm having a challenge to process and perform counts on text fields. I have about ~100k rows, Below is a snapshot of the data set

DocNo Section Reference Description

111 2.2.3.4 This is case1

124 Chap 4 Section 4.4.4.2 This is case2

100 Chapter 2 Section 2 This is case3

1985 S2.7.2.2 This is case4

The objective is the find the count of Section References for each document, Presently I'm using Regex Expression to capture valid refereces (like 2.2.3.4 ) and using cell splitter to put them in seperate columns and doing the counts with drill down (using Tableau Hierarchy feature), but the I'm not sure how do I handle case 2,3 and 4 with all the text fields in it.

Is there a good(or naive) solution in KNIME I can implement, like fuzzy match etc. and get counts for all section references, it doesnt have to be efficient or optimized.

Appreciate any leads, Thanks!

kilian.thiel · October 4, 2016, 11:14am

Hi Mohammed,

you could use the Regex Split node to extract groups defined via regex e.g. ".*(\d\.\d\.\d\.\d).*". This allows to extract more then one valid references in one string, and also to handle references like in document 1985 However, it will not fix your problem for document 100. Therefore you would need another regex.

Kind regards,

Kilian

mohammedayub · May 15, 2017, 4:31am

Thanks Kilian. Partially solves my problem. Let me give this a try.

-Mohammed