I have txt files extracted from heterogenous pdf documents, which I want to analyse for topic extraction. But the texts have lots of noise, which decreases the results of TF-IDF a lot.
Thats why I want to get rid of the noise, and only analyse the (relatively small) portions of relevant text. The noise mainly consists of:
- Repeating page header (usually between 1-6 lines)
- Repeating page footer, sometimes with slight differences (f.e. "Filename - Date - Author - Pagenumber" where only Pagenumber changes)
- Text from formulars and tables, lists with measuring results etc, which I don't want to analyse.
I can think of two main approaches:
- Trying to get rid of the noise
(like dublicate lines to get rid of header, and very similar lines to get rid of footer with page number)
- Trying to extract only relevant data
(like identifying complete sentences & headings)
What do you think might be the better approach? How could I implement this in Knime? Any other tipps or tricks for working with very noisy text files?
Cheers & thanks,
PS: Keygraph gets much better results, but still not perfect. But I don't know how to improve...
PPS: Here's a screenshot of my current workflow, comments welcome, but please be aware, that I'm pretty new to Knime... ;-)