Hello!
I am currently exploring Knime and I have a question about the relationship of the PDF Parser node and POS Tagger. I have identified that both nodes have the word “Word Tokenizer” and I have selected in both nodes the “OpenNLP English WordTokenizer”.
For my work I have to recognize the number of tokens that are identified after running the PDF Parser node and then recognize the number of tokens after the PDF Parser - POS Tagger work sequence.
I have recognized the number of tokens through the workflow Bag of Words Creator - CSV Writer. This process I do, as I mentioned, first only after the PDF Parser node and then on a second occasion after the PDF Parser - POS Tagger routine runs.
After carried out this process and identified the number of tokens that each process throws at me, I find different tokens (PDF only to get the number of tokens) but I do not understand why if both nodes (PDF Parse and POS Tagger) were conditioned to perform the tokenization using the same scheme.
I’m sorry for the late answer.
The ‘Bag of Words’ node creates a bag of words for each document within the table. Each term that occurs in the document will be listed only once in the bag of words even if it occurs more often. The difference between the number of terms in the BoW after applying the BoW node on tagged documents and after applying it on the untagged documents is that some terms can have different tags and therefore both term/tag combinations are listed in the table. However, the tokenization of the documents should be the same in both cases. (To check if the tokens are the same, you could apply the POS Tagger, than strip the tags again with the Tag Stripper node and then apply the BoW node. Afterwards you can compare the number of terms in each table.)
For example: The untagged document produces a BoW like this:
term1[.]
term2[.]
term3[.]
and the BoW for tagged document could be something like this:
Hi!
Thanks for the answer I have followed your intructions.
I have another question … where can I find the StopWords list for English and Spanish. Thank you
do you mean the built-in stop word lists that can be used by the Stop word Filter node? As far as I know, there isn’t any public resource to have a look at them right now.
If you want to use an own stop word list (to have a better overview which words are filtered), you can use the Dictionary Filter node. There are several stop word lists available in the internet (1, 2 etc.).