Bag of words: Same process - different results

kludikovsky · January 22, 2024, 6:23pm

Bag of word is splitting text differently for no obvious reason. See marked region in attached image.

Background:
I am trying to match bank account data with a member list. Therefore i take some columns from both datsets and try to find similarities.

As part of this process I preprocess the data by combining the columns, convert to document, erase punctuation, and convert the case before I try to create the bag of words.
This is basically identical processing to both datasets.

Expected Behaviour
The document is splitted into terms basically at word-boundary on both documents.

Actual Behaviour
One does split it a word boundaries - where it originated from own field, and was at the first position (single word in the field).

It does split the word into two terms - where it was part of a longer text and at the end position.

Image of the flows and results

Image of the documents

KNIME 5.2
Linux Mínt 21.2

mlauber71 · January 22, 2024, 11:51pm

@kludikovsky maybe check if you have based all the nodes in the sequence on the previous processed document and not on the original one.

Also maybe provide a sample workflow

kludikovsky · January 23, 2024, 10:33am

Hi @mlauber71,
thanks for your feedback.
Attached the workflow.

Just to make sure that youo can see what the process creates here I also post the results I get.

KNIME_project_Test_BoW.knwf (49.0 KB)

kludikovsky · January 25, 2024, 9:35am

Another one:
To exclude Excel read from the equation, I tought to use .csv-files instead. Same effect.

Next lets take the same input file. Result: stiff different output!!!

Has anybody any idea?

KNIME_project_Test_BoW_V3.knwf (114.9 KB)

kludikovsky · January 25, 2024, 10:35am

For everyone who might get into the same issue.

I finally found the solution:
It’s been the “String to Document” node which caused the differencies in output.
The working part used the “OpenNLP English WordTokenizer”
while the non working part had the “OpenNLP SimpleTokenizer” set.

system · February 1, 2024, 10:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.