Text Mining

Hi Knimers,
I am looking for a way to fetch GCV values and tolerance from sentence.
Background:
Data that I am using has a free text field where GCV value and tolerance limit is updated as a part of the text. However, the text can be anything that users enter (examples listed below). I need to fetch GCV value and tolerance limit from that into two columns. Any help would be greatly appreciated.

Example 1: Type: Mustard Husk Briquettes GCV: 3600 KCAL /KG. Ash: 12% (+/-2%) Moisture: Upto 10% Loading <(>&<)> unloading scope of vendor.
==> We need 3600 under GCV column and 0 under Tolerance Column

Example 2: Mustard Husk Briquettes GCV: 3600 KCAL (+/-200) Ash: 12% (+/-2%) Moisture: Upto 10% (+/-2%) Packing : 30-50 Kg. Loading <(>&<)> unl
==> Here, we need 3600 under GCV column and 200 under Tolerance Column.

**Example 3:**SUPPLY OF BIOMASS BRIQUETTE OF 100% SAWDUST DUST TO PROVIDE MINIMUM CALORIFIC VALUE OF 4200 (UOM - KGS) ( HSN CODE : 4401 )
==>4200 under GCV and 0 under Tolerance
Example 4: TRIAL ORDER. 100% SAWDUST BRIQUETTES. QUALITY PARAMETERS: MIN.GROSS CALORIFIC VALUE:4000 +/-200 KCAL/KG MOISTURE CONTENT: 7-10 % AS
==> 4200 under GCV and 200 under Tolerance.
Example 4: Special Instructions : - 1. The GCV of Briquettes should be Not less than 3500 Kcal/Kg, if it is more than specification no extra
==>3500 under GCV and 0 under Tolerance

As demonstrated in the examples above, there are lot of such variants. And I was thinking of of using NLP (something similar to Named entity recognition) to get the GCV and tolerance value in two columns. I would like to understand the best way to do this in Knime.

Thanks and regards
Deepan

Hi @Deepan,

sorry for the late response. I will have a look.

At first glance, I would say we should focus on preprocessing here. Try to remove most of the words that are unrelated. E.g. tag words that you want to keep like gcv or gross caloric value (e.g. using the Dictionary Tagger or the Wildcard Tagger). Use the POS Tagger to identify numbers in the text and afterward filter everything that was not tagged by both types of taggers (Tag Filter). You might also need to filter numbers with a percentage symbol beforehand e.g. using regex.

I will have a deeper look, but I wanted to give some pointers first. :slight_smile:

Best,
Julian

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.