Cleaning text with lots of unwanted elements/noise

holbue · November 7, 2015, 12:07pm

Hi,

I have txt files extracted from heterogenous pdf documents, which I want to analyse for topic extraction. But the texts have lots of noise, which decreases the results of TF-IDF a lot.

Thats why I want to get rid of the noise, and only analyse the (relatively small) portions of relevant text. The noise mainly consists of:

Repeating page header (usually between 1-6 lines)
Repeating page footer, sometimes with slight differences (f.e. "Filename - Date - Author - Pagenumber" where only Pagenumber changes)
Text from formulars and tables, lists with measuring results etc, which I don't want to analyse.

I can think of two main approaches:

Trying to get rid of the noise
(like dublicate lines to get rid of header, and very similar lines to get rid of footer with page number)
Trying to extract only relevant data
(like identifying complete sentences & headings)

What do you think might be the better approach? How could I implement this in Knime? Any other tipps or tricks for working with very noisy text files?

Cheers & thanks,

Holger

PS: Keygraph gets much better results, but still not perfect. But I don't know how to improve...

PPS: Here's a screenshot of my current workflow, comments welcome, but please be aware, that I'm pretty new to Knime... ;-)

kilian.thiel · November 16, 2015, 3:28pm

Hi Holger,

are the PDF files (and also the converted txt files) structured in a similar way? If so you could read the files with e.g. the Line Reader node and clean the data line by line based on e.g. header and footer keywords. Therefore you could use e.g. Row Filtering or String Manipulation, or rule based filtering.

Finally you could append a column with a constant value (Constant Value node) and use the GroupBy node to concatenate all lines (rows) to one string cell which is then used as text.

Cheers, Kilian

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.