Compare and match 2 Columns

D_Valle · October 7, 2021, 11:05am

I have a problem where I have columns with text in 2 separate tables. I need to compare and match between the 2 columns. This would be an example:

Table 1, Column 1:

TRR6/7 Waterheater

Table 2, Column 1:

654864273_ Waterheater TRR6/7 - Sparepart

I need KNIME to compare and match between the 2 columns or give me a percentage of similarity.

Thanks!

ipazin · October 7, 2021, 11:16am

Hello @D_Valle,

and welcome to KNIME Community!

There have been many topics around similarity search. Here is one from a couple of days ago with links to many resources that can help:

Explore it and in case of any questions/issues feel free to ask

Br,
Ivan

D_Valle · October 7, 2021, 5:25pm

Thanks for your answer, @ipazin
in the other thread they recommended the node “String Similarity” unfortunately it did not work because that makes a comparison between 2 cells of the same row of the same document.
What I need is for me to find a similar cell in the entire column of the other document.

Example:

Thanks!

HansS · October 7, 2021, 6:12pm

Hi @D_Valle

It depends a little bit on the size of your dataset, but maybe you can first use the Cross Joiner node to create all possible combinations and then do a similarity search.
gr. Hans

ipazin · October 11, 2021, 10:24am

Hello @D_Valle,

in linked topic there are also other suggestions. (Take a look at @mlauber71 reply and links he provided.)

Br,
Ivan

aworker · October 11, 2021, 11:34am

Hi @D_Valle and welcome to the KNIME forum

Suggestions by @ipazin and @HansS are perfectly fine and sensible to compare in general strings with insertions, deletions and substitutions. However, I notice that your examples also have permutation and this may be the issue. For instance, you would like these two sentences to be considered as the most similar among the others:

TRR6/7 Waterheater

654864273_ Waterheater TRR6/7 - Sparepart

when compared each other, given that they share 2/4 words, regardless of the word order. As showed in the example, “Waterheater” and “Sparepart” are permuted. In this case, the -String Similarity- node or b.t.w. the -Similarity Search- node will perform not as well with for instance a Levenshtein metric.

The solution to deal with permutation in string similarity, is to tokenize your sentences, i.e. to convert them into a number of binary type (or frequence type) columns where every column tells whether a “token” (i.e. a word) is present or not in a sentence. In this case, the important thing is the presence or absence of same words, not the order. The similarity comparison between string is thus done in terms of proportion of same words present in two different sentences.

The set of binary columns can be in this case converted for simplicity into a single column of type bit vector to become a fingerprint. If all the words (despite permutation) are present in two sentences, similarity between fingerprints is then maximum. The metric to use here is for instance Tanimoto (for binary fingerprints) or cosinus distance for frequency fingerprints. KNIME has all the needed nodes to achieve this type of string comparisons.

I would hence tokenize the sentences to convert them into binary fingerprints and then use the -Similarity Search- node but this time with a Tanimoto metric.

Hope this helps. If this sounds too complicated to put in place but interesting to you, please share a dozen of rows or more of your data with couples of strings which should be considered as similar and I’ll provide from there a workflow implementing this suggestion.

Best

Ael

D_Valle · October 12, 2021, 3:07pm

Hello @aworker @ipazin @HansS ,

thank you very much for your answers!

@aworker you described my problem very well. Permutations is a complicated part of the issue. I have prepared 2 Excels with about 20 rows. The original document have more than 8k rows.

Extra information:

The column “Product” of the “Insurance” document at best contains the information found in the columns “Functional location” and “Technical Object” of the “Maintenance” document.

Thank you very much for your help!

Insurance.xlsx (11.6 KB)
Maintenance.xlsx (10.6 KB)

aworker · October 12, 2021, 5:15pm

Hi @D_Valle

Thanks for the compliments and the files. Please find below a possible workflow solution adressing this problem:

20211012 Pikairos Compare and match 2 Columns.knwf (152.5 KB)

The workflow is self-commented but please do not hesitate to come back if you have questions.

Please be aware that even if here the comparison was 100% successful, any sentence comparison with such complicated variations may not be always successful. Nevertheless, I believe this solution solves a big percentage of the automatic matching problem.

Hope this helps.

Best

Ael

aworker · October 14, 2021, 7:22am

Hi @D_Valle

Hope everything is going well with you.

Did you have the opportunity to try and test the workflow on your 8k rows ? If so, how well did it perform ? I’ll be grateful to have your feedback @D_Valle

Thanks & regards,

Ael

D_Valle · October 15, 2021, 3:09pm

Hi @aworker !
Thank you very much for your answer!
Now I am testing it with the real files and checking the accuracy of the results.
At the moment it seems to be a very good solution!
Thanks again!

aworker · October 15, 2021, 3:27pm

Hi @D_Valle !

Thanks for your feedback and really glad to read that this is working so well !

There is room for improvement though and for new features beyond this preliminary solution, if needed.
Please let me know how things go and get back in touch if you have any further questions.

Good luck & best wishes,

Ael

DivineBlink · March 29, 2022, 10:28am

Hi @aworker and the rest!

I have a similar problem and would really appreciate if you could give me a hand. The issue is following:

I am using 2 excel files - Basketball_players and Baskeball_dataset.
The idea is to compare the records from the column “Title” (Basketball_dataset) and column “Player” (Basketball_players) in order to find similarities within.
When found - rename the record from basketball_dataset to match the name of the most similar record from basketball_players (maybe use rule engine or smth)
Please see example bellow:
e.g. 2020-21 PANINI ILLUSIONS RICKY RUBIO #18 TIMBERWOLVES will probabbly be the most similar to Ricky Rubio record. Final result should be Ricky Rubio.
same goes for 2014-15 RICKY RUBIO PANINI ELITE BASE CARD! → Ricky Rubio.

I am attaching excel files as well.

Thanks a lot in advance!
Basketball_dataset.xlsx (11.5 KB)
Basketball_players.xlsx (83.0 KB)

aworker · March 29, 2022, 4:35pm

Hi @DivineBlink and welcome to the KNIME forum community

Have you tried to adapt to your data the workflow I posted in this thread ? You have this same workflow as a solution in this other solved thread too:

Both workflow versions show how you could adapt it to your data and needs. Please have a try to it and let us know if you have any questions. I will be happy to further help from there if needed.

Best

Ael

system · June 2, 2023, 9:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.