Extract & Count Keywords from set of PDF

zimpstar · December 16, 2020, 2:41pm

Hi!

I’m trying to create a flow that will do a few things:

Parse some pdfs (about 280 of them)
Extract keywords that I choose
Count the keywords and match them with the file name, per column.

The output should look something like this:

recipes

As you can see, I want to know how many time the word, say “chocolate” appears in each pdf, so I can later rank the files by, for example, “the most chocolate-y”.

I’ve gone through most knowledge bases here and even got somewher on my own but I guess Im just too old/slow to make this work on my own … Anyone care to help me out? I would be so thankful!

Thank you!

Kathrin · December 16, 2020, 4:28pm

Hi @zimpstar,

welcome to the KNIME Forum.

For your use case I would use the text processing extension of KNIME Analytics Platform. Are you familiar with text processing?

My idea would be to

Use the PDF Parser node or Tika Parser node to read your PDFs (in case you use the Tika Parser node, you need the strings to document node in addition)
Use the Dictionary Tagger node to tag all your Keywords
Use the Tag Filter node to remove all words which you didn’t tag
Create a Bag of Words using the Bag of Words Creator node
Use the TF node (stands for Term Frequency) to add a column with the information of how often a keyword occurs in a document
Create a Document Vector with the Document Vector node (here it is important to uncheck the checkboxes “Bitvector” and “As collection cell”.

The result should be similar to your table

Please let me know if you need more information in one the steps!

Cheers
Kathrin

PS: Between Step 1 and 2 you might want to clean up your documents, e.g. by lowercasing everything.

zimpstar · December 16, 2020, 4:41pm

Hi Kathrin,

Thank you kindly or the guidance. I’m still having some issues - not sure why.

Here is the error I’m getting:

error

Also, won’t the tag filter node take a really long time/stress out my CPU? Its about 30k words total. Thank you for your patience!

Cheers

zimpstar · December 16, 2020, 4:54pm

I did some more tests and this is what I have now:

The error:

ERROR Excel Writer 3:8 Execute failed: The input table at port 0 contains exeeds the column limit (16384) for XLSX.

Any ideas why this might be happening?

EDIT: Looks like its making each keyword in the entire document into a column … I think i messed up in the table creator -> dictionary tagger setup. How should I structure the keywords?

izaychik63 · December 16, 2020, 6:37pm

You need to use instead of Document Vector Tag to string and Pivot nodes. The error you have because you included the document not the document name. TF node has to keep Filepath column. Use it as document name.

iperez · December 16, 2020, 7:47pm

@zimpstar I don’t know if this is not as simple as the approaches given but I think this works:

Text Counting.knwf (249.5 KB)

zimpstar · December 17, 2020, 4:32pm

This worked beautifully. Thank you!

system · December 24, 2020, 4:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.