Line breaks as white space for PDF Parser

Hi everyone. I have a question regarding how to make the PDF parser convert line breaks into white space. I have a folder containing hundreds of PDF files of state legislative bills that look like this: 

https://dl.dropboxusercontent.com/u/107417586/pdf.JPG (the images dont appear so I give a link to the screenshots)

The thing with these PDF documents is that at the end of each line there is a line break. So for example, for the first two lines in the body of the document above,

  1. The Federal American Recovery and Reinvestment Act of 2009
  2. provides for a competitive education grant program that is known as

There is a line break after "2009" in line 1. and after "as" in line 2.

I used the PDF Parser to load all the documents into KNIME, and then used the Document Data Extractor and got a resulting table that looks like this: 

https://dl.dropboxusercontent.com/u/107417586/document2.JPG

The highlighted row is the row corresponding to the PDF document above. The problem is that the words at the end of each line and the beginning of the next are attached together. As an example (encircled in red), you see that "2009" and "provides" are encoded as one word: 2009provides.

My end goal is to tag specific terms of interest from the body of the text documents and create term frequencies from it. Therefore it is important that spaces are provided between line breaks, otherwise some terms will not be recognized and the TFs will be wrong. Is there a way to get around this bug?

Thanks in advanced and I appreciate the help. 

Sincerely, 

Vigile

Hi Vigile,

converting documents back to simple strings using the Document Data Extractor in order to inspect the tokenization can be misleading. Two or more tokens can be shown "tied together" in a string without separating white spaces although there are separated tokens in the documents. This happens if characters that are not shown as whitespaces, like line breaks separate these tokens. However these tokens are detected as tokens / separated word. To see how the document has been tokenized exactly, use the Bag of Words creator node. This node will list all unique words (terms) of the input documents. You should see, that "2009" and "provides" are shown in different rows of the table, indicating that they are two different words in the document. The TF node works on the tokens of the document, meaning that these two words will be counted differently.

Cheers, Kilian

Hi Kilian, 

Thanks so much for your answer! :-) It is useful to know that words in a document are tokenized separately even when they look "tied together." This resolves my issue.

Sincerely, 

Vigile

Hi, sorry, I have a related question. If I wanted to use a Wildcard Tagger for regular expression matching at the multi-term level, then I cannot specify white spaces between terms in my regex column because some of the words in the PDF have a line break instead of a white space between them, is this correct?  

Thanks again, 

Vigile

Hi Vigile,

the best way to see how words have been tokenized is usually to create a bag of words.

About your question regarding the wildcard tagger: yes, you are right use whitespace characters "\s*" in your regular expression.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.