Extract text from pdf's

belgarath801 · November 13, 2017, 2:52pm

Hello,

I am new to KNIME so not sure whether this question applies here. I am trying to extract some phrases from PDF files. I use pdf parser then (dictionery tagger + table creator) and I get a Documents output table. Then I try to extract these tagged words using row filter or modifiable term filter but I think these nodes exclude the tagged words. Am I write? If yes how can I proceed. Also read something about "General Tag Filter" but it doe not seem to appear in my list of installed nodes and cannot find it using this name under "install knime extensions".

Thank's for your help

daria.goldmann · November 21, 2017, 12:10pm

Hi Belgarath,

When using Row Filter to extract tags you are basically running through the rows in your table instead of filtering the content of your documents. And I think it's not what you want.

Have you tried using Tika Parser node to read pdf files, followed by Strings to Document node, Dictionary Tagger node, and Tag Filter? To view the documents and the changes you make along the way, e.g. examine which phrases were tagged/filtered, you could use the Document Viewer Node.

"General Tag Filter" node you are looking for is Tag Filter node that is available through Text Mining extensions in KNIME Labs (you probably have those installed already)

Hope it helps!

Daria

rsalois86 · May 31, 2018, 9:22am

To be frank, I’d never considered parsing as the convenient way to extract text from PDF files. When working with multiple documents, I guess so, but I usually do so with only one, so it’s enough just to open it in an editor like this one https://edit-pdf.pdffiller.com/ copy and paste. At least it seems to be more organized and structured when done in such a fashion, but who knows (lots of, but not me)