search and count words from pdf files

Hello everybody,

I'm new to knime and I would ike to search for keywords and count them from pdf files. I started with the operator "PDF Parser" then "BoW Creator". Is there any operator where I can enter different keywords I want to find in the texts and count them?

Thanks for your help

Try using the TF node, and calculate the absolute frequency per word or term.

now use the Term to String node to change the bag of words column into a string column.

you have a few options now to do the actual searching...

 

Connect a row filter node on this string column in which you can enter a word and the output will contain the word and number of occurrences.

Or

connect a reference row filter on this string column, and also a table creator node to this reference row filter node. You can now input multiple words into the table creator node and get the output of words and frequencies after the reference row filter node.

finally you could instead use the nominal value row filter in which you can select your desired words from the list.

 

 

hope this helps

simon

Hello Simon,

first thank you very much for your reply. It helped me a lot. My questions to you:

- do you know how I can enter words with wihtespace in between?

- can I get a view so that the words and the occurance appear as a whole and not for every pdf file?

 

Many thanks for your help

As I understand, the PDF parser node is pointed to a directory containing multiple PDF files, and BoW creator node pulls out words from all the PDFs together, so aren't the words and occurrences already being displayed as a whole?

In terms of white space, I think the key here is getting the BoW creator node to recognise phrases etc, rather than single words by default. To do this you will need to deploy some of the enrichment nodes prior to the BoW node. This then allows phrases and multiple words to be identified by the BoW creator. Some useful enrichment nodes would be the POS tagger, opennlp. You could even have specific phrases tagged by using the dictionary tagger or wildcard tagger.

simon.