.pdf to document || string to document for OpenNLP NE Tagger

Hello

Can someone please help me or give some advice with this issue I can not seem to get past. My overall goal is to read in a .pdf document and then use OpenNLP NE Tagger to extract all the names. I am trying to base this loosely on this article Named Entity Recognizer and Tag Cloud Example | KNIME

I have tried two methods 1) using PDF Parser and 2) using Tika Parser with a Strings To Document node, both methods have the same issue (I know I am the common denominator!)

I am testing with 3 different .pdf from 3 different sources. As the content of these .pdfs are confidential I am only able to show a restricted sample of my results and what I have done.

The issue is: the output of method 1 and 2 both produce a document. it appears that when KNIME constructs the document it doesn’t take into account different lines and the result is


I have blacked out confidential stuff and I have highlighted in yellow the start of a new line. As you can see these lines are joined without spaces. This means that if a line 44 ends in “… beam me up Scotty” and line 45 starts with “James Tiberius Kirk …” the document reads “beam me up ScottyJames Tiberius Kirk”.

So, my first question is: how can I stop this? second question: even if I can get round this, is my planned goal achievable?

Many Thanks

Frank

Can you please try this workflow first with your idea:

I suspect that what you are showing is the document column which is actually just how KNIME has decided to show you the title. You can turn off this misleading preview by making “Title” an empty string.

Screen Shot 2023-03-07 at 12.13.16 PM

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.