BoW - removes tokens?


The BoW nodes seems to delete some tokens from the text when creating the token. I wonder what is happening.

The chain:  PDF Parser -> BoW creator.   In parallel, PDF Parser -> Document Viewer.

For example, a text snippet from the PDF from the Document Viewer:

Fehlerbeschreibung Abhilfe A 00 100 Software-Version passt nicht zur Hardware-Version A 00 101 Prüfsummenfehler Totalreset und Neuabgleich erforderlich A 00 102 Prüfsummenfehler Totalreset und Neuabgleich erforderlich W 00 103 Initialisierung - bitte warten Falls die Meldung nicht nach einigen Sekunden verschwin- det , Elektronik tauschen . A 00 106 Download läuft - bitte warten Beendigung des Download abwarten A 00 110 Prüfsummenfehler Totalreset und Neuabgleich erforderlich A 00 111 A 00 112 A 00 114 A 00 115 Elektronik defekt Gerät aus-/einschalten



After BoW, in the document output table, a good amount of the words and numbers are gone:


Any idea, what happens here? 

Hi s3ma,


i tried to reproduce the problem by using the Table Creator -> Strings to Document -> Bow node with the above written text but the Bow is created correctly. No terms or numbers are missing. I wonder if the pdf is parsed correctly (the Apache pdf box lib is unsed internally). Does the Document Viewer show the correct text? You can use the Document Viewer after the Bow (PDF Parser -> Bow -> Document Viewer) and see if the document contains all terms and numbers. What operation system are you using? Could you attach an example workflow including data to reproduce the problem, that would make it much easier for me to find the problem.


Thanks, Kilian

Hi Kilian,

thanks for the fast reply.

Here's a complete workflow group directory. The PDF is in the BA1 ZIP. The text is on page 104 of the PDF. With PDF Parser -> DocumentViewer, I see the correct text. With the "Documents Output Table" in the context menu of the BoW creator, the text for the table from page 104 starts at row 3870/3880.

Operating System is Windows 7 Enterprise 64bit Service Pack 1 - using KNIME 2.7.4 64bit.



Hi s3ma,


thanks for the data and the workflow. I think i get your point. At page 104 of the pdf there are terms (e.g. "103") that does not occur at the output table of the bag of words node in lines 3870 or later. The reason is, that the BoW contains a set of distinct terms only. This means that the term "103" is listed somewhere before line 3870 in the BoW output table. To search for words in the output table, open the output table view and press Ctrl + F and then enter the search term. You can alternatively click on the header of the Term column to sort the rows and thus search more easily. Term "103" for instance is listed in row "Row 249".


I hope this helps.

Cheers, Kilian

Ah, thanks, this explains the behavior.


