BUG in "Tag Stripper" node or in "GroupBy" node?

KNIME Version: KNIME 3.5.2

I have a PDF document that I read with a PDF Parser. Along the line I tag the document (for example with the Stanford Tagger). I create a Bag of words from the document, then I strip the tags from the document and make a GroupBy on that stripped document without any aggregation. Now I expect to have the same table as I had right after the parser. However I am not - the grouping does not work. Why?




an you attach an example workflow to reproduce the problem?

Cheers, Kilian

I already attached the file and an image - I am not sure why they are not displayed. What do I need to do?

You need to use the Tag Stripper node before the Bag of Words node. Or group only on the Document column.

The reason for this is because the preprocessing nodes, such as the Tag Stripper will create a new preprocessed document instance for every document of the input table. If you run the preprocesing nodes on a bow they will create many more documents as needed, since the bow has usually more rows than there are documents in the corpus.

If you apply the preprocessing nodes directly on the list of documents e.g. the output of the Tagger node they will create the right amount of new preprocessed documents.

Cheers, Kilian

I think I got it:

The original document column contains in each cell only references to the same Java object (instance). So let us assume I originally have OD=1 document. Now I am using a Bag of Words Creator (BOW). Now the BOW is to create new Java objects (class instances) of this document, but instead puts a reference into each cell to this cell. When I use a GroupBy node on this column it checks for object identity, which checks out - I get only OD rows back.

If I now use a Stanford Tagger before the BOW, the BOW is able to also just copy the references to the tagged document, without building a bunch of actually new Java instances (instanciated with Java's "new"). So the GroupBy node, checking for object identity, can also work on the tagged document column - I get only OD rows back.

However if I use a node, like here the Tag Stripper, after the BOW, the Tag Stripper is not able to check which references in the respective input document column (here the tagged document) reference to the same document instance. So, the Tag Stripper simply creates a new stripped document instance for each row. When the GrouBy node checks now for object identity in the stripped document column, it does not find any equal ones and I get just as many rows as I had from the BOW.

What I implicitly wrongly expected was that a) the BOW checks which of the references in the column of tagged documents actually reference to the same Java instance (which could actually be implemented this way - it would make the nodes computationally more efficient) or b) that the GroupBy node does not check for object instance identity but does a deep object member comparisson based on the contained model (implementing this, might be difficult, depending on the implementation of the objects).


Did I get this right this time?