Bag of words: Same process - different results

Bag of word is splitting text differently for no obvious reason. See marked region in attached image.

Background:
I am trying to match bank account data with a member list. Therefore i take some columns from both datsets and try to find similarities.

As part of this process I preprocess the data by combining the columns, convert to document, erase punctuation, and convert the case before I try to create the bag of words.
This is basically identical processing to both datasets.

Expected Behaviour
The document is splitted into terms basically at word-boundary on both documents.

Actual Behaviour
One does split it a word boundaries - where it originated from own field, and was at the first position (single word in the field).

It does split the word into two terms - where it was part of a longer text and at the end position.

Image of the flows and results

Image of the documents

KNIME 5.2
Linux MĂ­nt 21.2

@kludikovsky maybe check if you have based all the nodes in the sequence on the previous processed document and not on the original one.

Also maybe provide a sample workflow

Hi @mlauber71,
thanks for your feedback.
Attached the workflow.

Just to make sure that youo can see what the process creates here I also post the results I get.

KNIME_project_Test_BoW.knwf (49.0 KB)

Another one:
To exclude Excel read from the equation, I tought to use .csv-files instead. Same effect.

Next lets take the same input file. Result: stiff different output!!!

Has anybody any idea?

KNIME_project_Test_BoW_V3.knwf (114.9 KB)

For everyone who might get into the same issue.

I finally found the solution:
It’s been the “String to Document” node which caused the differencies in output.
The working part used the “OpenNLP English WordTokenizer”
while the non working part had the “OpenNLP SimpleTokenizer” set.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.