Relation PDF Parser and POS Tagger

Hello!
I am currently exploring Knime and I have a question about the relationship of the PDF Parser node and POS Tagger. I have identified that both nodes have the word “Word Tokenizer” and I have selected in both nodes the “OpenNLP English WordTokenizer”.

For my work I have to recognize the number of tokens that are identified after running the PDF Parser node and then recognize the number of tokens after the PDF Parser - POS Tagger work sequence.

I have recognized the number of tokens through the workflow Bag of Words Creator - CSV Writer. This process I do, as I mentioned, first only after the PDF Parser node and then on a second occasion after the PDF Parser - POS Tagger routine runs.

After carried out this process and identified the number of tokens that each process throws at me, I find different tokens (PDF only to get the number of tokens) but I do not understand why if both nodes (PDF Parse and POS Tagger) were conditioned to perform the tokenization using the same scheme.

I appreciate an answer, I remain attentive.

Manuela

Hey Manuela,

I’m sorry for the late answer.
The ‘Bag of Words’ node creates a bag of words for each document within the table. Each term that occurs in the document will be listed only once in the bag of words even if it occurs more often. The difference between the number of terms in the BoW after applying the BoW node on tagged documents and after applying it on the untagged documents is that some terms can have different tags and therefore both term/tag combinations are listed in the table. However, the tokenization of the documents should be the same in both cases. (To check if the tokens are the same, you could apply the POS Tagger, than strip the tags again with the Tag Stripper node and then apply the BoW node. Afterwards you can compare the number of terms in each table.)

For example: The untagged document produces a BoW like this:

  • term1[.]
  • term2[.]
  • term3[.]

and the BoW for tagged document could be something like this:

  • term1[tag1]
  • term2[tag2]
  • term2[tag3]
  • term3[tag4]

I hope, this answers your question.

Cheers,
Julian

Hi!
Thanks for the answer I have followed your intructions.
I have another question … where can I find the StopWords list for English and Spanish. Thank you

Manuela

Hey Manuela,

do you mean the built-in stop word lists that can be used by the Stop word Filter node? As far as I know, there isn’t any public resource to have a look at them right now.
If you want to use an own stop word list (to have a better overview which words are filtered), you can use the Dictionary Filter node. There are several stop word lists available in the internet (1, 2 etc.).

Cheers,

Julian

Hi!
thank you for the clarification and for giving me the attached lists.
You have helped me a lot, thank you very much.
Manuela