Search for specific terms in pdf files

Dear knime community,

I'm desperitely searching through the forum for days to find a very easy solution to search for specific termins in different pdf files. I want to search and them display the terms in a table and in a term cloud.

What i've done so far is added a PDF Parser and a BoW Creator of the start of my workflow. All the descriptions to text processing etc. didn't help me out so far.

I'm very new to knime and hope to find very easy help here.

Thanks in advance

Hope you can help me out.

 

 

Hi reclam,

the PDF Parser was the right node to start with. Then use the Dictionary Tagger node to find terms in the parsed documents. You need to provide also a list of terms that you want to search for. This list is your dictionary. Then filter all term except the tagged ones e.g. with the General Tag Set Filter. Finally create a bag of words.

Attached is an example workflow.

Cheers, Kilian

Hi Kilian,

thanks for you help, which works out perfectly.

Still I have two more questions

  1. I download my PDF-Files directly from the internet (blogs, journals etc.) and convert them to pdf with chromes "save to pdf" - can I use/pull the corresponding files from the web directly without converting/downloading them to a pdf file (eg. with the httpRetriever?)
  2. In addition to count every terms within one pdf file I want to count the pdf-files the term occurs at least once. Let me make an example - I'm searching for the term "bonus" and I have 50 PDF-Files as a data basis. What I'm trying to find out is: "In how many of the 50 PDF-Files the word "bonus" occurs at least once. Is the a method to do this with knime?

Thanks in advance for your help!

Best regards

Sebastian

Hi Sebastian,

1.) the Http Retriever allows you to download webpages, if you have e.g. http://www.knime.org/blog as input URL the node will download the complete html code of this specific webpage. You can convert this result into a XML Cell with the Html Parser node. Then use e.g. the XPath node to extract specific fields of the XML (Xhtml), e.g. textual fields.

Http Retriever->Html Parser->XPath->...

This combination of Palladian nodes and XML nodes allows you to extract web data easily.

2.) Yes, there is. Build a bag of words then use the GroupBy node and group over the terms and aggregate over the document column (COUNT). This will result in one row per unique term and a number of documents the terms occurs in.

...-->Bag of Words->[Term to String->]GroupBy

You can use the Term to String node optionally to strip the tags.

Cheers, Kilian

 

Hi Kilian,

thanks for your help.

At least I'm starting to understand knimes-logic :) Sorry - this is really not my world ;)

I've tried to follow your advice but was not quite successful.

What have I done wrong? Or did I miss anything after the x-path?

Thanks in advance


Best regards

Sebastian

 

 

Sorry - i forgot attaching the process, I'm stuck in.


BG
Sebastian

Hi Sebastian,

attached is an example workflow that shows how to extract data with XPath. To learn more about XPath see: http://www.w3schools.com/xsl/default.asp or the XPath Node see: https://www.knime.org/files/nodedetails/_xml_XPath.html.

After data extraction, strings have to converted into documents, documents into a bag of words. Finally terms can be counted. All this is shown as well in the example workflow.

Cheers, Kilian

Hi Kilian,

a big thanks for creating the string for me. I came back from vacation yesterday and was just trying to use the string for my data analysis.

When I want to execute the string with approx 30 web pages two errors occure.

WARN      HttpRetrieverNodeModel             error retrieving https://www.accenture.com/t..... (ETC.)

-> can this be because the web pages are a pdf file?

ERROR     HtmlParser                         Execute failed: Java heap space

-> Seems that my system / java is not fast enough. How can i fix this?

The corresponding string is attached.

Thanks for your help in advance!
Best

Sebastian

Hi Sebastian,

if you download PDF content from an URL with the HTTP Retriever node this content can of course not be parsed reasonably with the HTML Parser node. You can use the HTTPResultDataExtractor to extract the content type of the URL from the HTTP header. Filter out all rows with "application/pdf" to get rid of PDF content.

Attached is your workflow with the filtering based on content type.

Additionally you can put the HTTP Retriever node between a Try and a Catch node to avoid that the whole workflow will stop when the HTTP Retriever will throw an error. Maybe sometime the connection is bad or the server ir down ... you never know.

Cheers, Kilian

Hi Kilian,

first of all - thank you very much for helping me so detailed. Without your help, I wouldnt get the workflow working correct at all. I'm quite lost with this :(

When i'm opening the workflow I now get a quite long list of erros. 

Log file is located at: C:\Users\Sebastian\Documents\Knime Projects\.metadata\knime\knime.log
WARN      FileNodePersistor                  Unable to load port content for node "MISSING HttpRetriever": Invalid outport index in settings: 2
WARN      MISSING HttpRetriever              Node can't be executed - Node "HttpRetriever" not available (provided by "palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky."; plugin "ws.palladian.nodes" is installed)
ERROR     LoadWorkflowRunnable               Errors during load: Status: DataLoadError: KNIME_project21 0 loaded with error during data load
ERROR     LoadWorkflowRunnable                 Status: DataLoadError: KNIME_project21 0
ERROR     LoadWorkflowRunnable                   Status: Error: Node "HttpRetriever" not available (provided by "palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky."; plugin "ws.palladian.nodes" is installed)
ERROR     LoadWorkflowRunnable                   Status: DataLoadError: MISSING HttpRetriever 0:19
ERROR     LoadWorkflowRunnable                     Status: DataLoadError: Unable to load port content for node "MISSING HttpRetriever": Invalid outport index in settings: 2
ERROR     LoadWorkflowRunnable                     Status: DataLoadError: State has changed from CONFIGURED to IDLE

This is were the workflow stops right away. One question concerning the pdf's parsing from the internet. Can i connect the html retriever workflow with the earlier pdf parsing workflow? I dont have too many pdf's but some are not available online but are stored on my harddrive. 

Best

Sebastian

 

Hi Sebastian,

what version of KNIME are you using? I created the workflow with the latest KNIME version 2.12 and the latest Palladian version. To be able to load the workflow you need 2.12.

Cheers, Kilian

Hi Kilian,

I've updated to the newest KNIME Versions but the erros still occure.

I've attached the screenshot which pops up when i load a workflow.

What am I doing wrong?

 

Best

Sebastian

Hi Kilian,

I managed to fix the error. I unistalled knime and installed it completely again.

Back to the actual workflow. Stil i get the error of heap java space at html parser. Can you help me with this? (screenshot attached)

Best

Sebastian

 

Hi Kilian,

I've managed to eliminate the errors. 
Still I have one question.

At the moment I dont use any dictionary (do I?) but the workflow does not count every word (for example the word "and" is not count at all) - what I want to do is to use two testing groups. With the first I identify my variables and with the second I want to do the acutal testing.

How do I let the workflow count EVERY word and output me the frequencies?

Best
Sebastian

 

Hi Sebastian,

if you haven't filtered the word "and" e.g. with a Stop Word Filter node then the word is counted as well by the TF node. A workflow like:

Parser->Bag of words creator->TF

would count every word in the parsed documents.

Cheers, Kilian

Hi Kilian,

i cant find a stop word filter node in the workflow you prepared for me but still the workflow does not extract every word.. It still counts approx. only 20 words. I've added the Tag cloud node and there are only a few words shown.
Do you have a clue for me?

Secondly i'm trying to include my pdf parser node into the latest workflow. were do i have to bring the pdf parser in? After the node strings to document?

 

Best

Sebastian

Hi Sebastian,

I can not see that words as "und" and "and" are filtered out. Attached is the example worklfow with a Tag Cloud, showing all words of the document set and a part of the tag cloud as image with the words "and" and "und" highlighted.

The Parser nodes have to be applied at the beginning of the workflow. Usually instead of the Strings to Document node, since the Parser nodes create a output table consisting of one document column.

Cheers, Kilian