Extract & Count Keywords from set of PDF

Hi!

I’m trying to create a flow that will do a few things:

Parse some pdfs (about 280 of them)
Extract keywords that I choose
Count the keywords and match them with the file name, per column.

The output should look something like this:

recipes

As you can see, I want to know how many time the word, say “chocolate” appears in each pdf, so I can later rank the files by, for example, “the most chocolate-y”.

I’ve gone through most knowledge bases here and even got somewher on my own but I guess Im just too old/slow to make this work on my own … Anyone care to help me out? I would be so thankful!

Thank you!

Hi @zimpstar,

welcome to the KNIME Forum.

For your use case I would use the text processing extension of KNIME Analytics Platform. Are you familiar with text processing?

My idea would be to

  1. Use the PDF Parser node or Tika Parser node to read your PDFs (in case you use the Tika Parser node, you need the strings to document node in addition)
  2. Use the Dictionary Tagger node to tag all your Keywords
  3. Use the Tag Filter node to remove all words which you didn’t tag
  4. Create a Bag of Words using the Bag of Words Creator node
  5. Use the TF node (stands for Term Frequency) to add a column with the information of how often a keyword occurs in a document
  6. Create a Document Vector with the Document Vector node (here it is important to uncheck the checkboxes “Bitvector” and “As collection cell”.

The result should be similar to your table :slight_smile:

Please let me know if you need more information in one the steps!

Cheers
Kathrin

PS: Between Step 1 and 2 you might want to clean up your documents, e.g. by lowercasing everything.

1 Like

Hi Kathrin,

Thank you kindly or the guidance. I’m still having some issues - not sure why.

Here is the error I’m getting:

error

Also, won’t the tag filter node take a really long time/stress out my CPU? Its about 30k words total. Thank you for your patience!

Cheers

I did some more tests and this is what I have now:

The error:

ERROR Excel Writer 3:8 Execute failed: The input table at port 0 contains exeeds the column limit (16384) for XLSX.

Any ideas why this might be happening?

EDIT: Looks like its making each keyword in the entire document into a column … I think i messed up in the table creator -> dictionary tagger setup. How should I structure the keywords?

You need to use instead of Document Vector Tag to string and Pivot nodes. The error you have because you included the document not the document name. TF node has to keep Filepath column. Use it as document name.

@zimpstar I don’t know if this is not as simple as the approaches given but I think this works:

Text Counting.knwf (249.5 KB)

5 Likes

This worked beautifully. Thank you!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.