Text Similarity Search - error "Argument Contains Duplicates"

takeAfew · April 5, 2023, 3:44pm

Hello everyone, I may have a problem for you!

I am performing a similirarity de text based on the workflow of this post: https://forum.knime.com/t/compare-and-match-2-columns/36593/8?u=takeafew

My goal is almost the same as the creator of the post, but once I try to apply my files I get the error in the title that the “Argument Contains Duplicates”. Does anyone know why or how to fix it?
I’ve tried so hard to figure out a solution but I can’t find it (I have little experience I premise)

Thank you very much in advance to everyone for your help!

Similarity Test 2.3.knwf (77.0 KB)

aworker · April 5, 2023, 4:18pm

Hi @takeAfew

I have downloaded the workflow you posted (actually the same as I posted as solution in the mentioned thread) but it doesn’t have your data associated to it. Could you please upload your Excel files here too so that we can check what is not working ?

Thanks & regards,
Ael

takeAfew · April 6, 2023, 7:25am

Hi @aworker , I think there wasn’t a better person who could answer me
I attached basically your same workflow because I tested some variations but without success and your one is still in m opionion the closest to the result.

Here the 2 example of dataset:

Test1.xlsx (9.9 KB)
Test2.xlsx (10.7 KB)

They are very similar btw each others but now I’m going to tried to expain you the situation:

I have two retailers which have several product with different specs (brand, name, pack);
Some of them are shared across the retailers but maybe with slight differences.

My goal is to merge all the specs (column “unire”) and be able to run a Text Similarity to identify the “common” products. The perfect output should be to be able to set an x value as threshold and to return a final table with the name of both “similar” products with theri relative score.

(Ex: | “Product Rtlr1” | “Product Rtlr2” | Similarity Score | )

I hope to have given you enough information. if not I am always available to talk about it!
Thank you very much in advance to everyone for your help!

aworker · April 6, 2023, 9:47am

Hi @takeAfew

Thanks for your nice message and compliments !

Please find below the workflow adapted to your data:

I believe the difficulty was that the products didn’t have a column of associated unique identifiers to be able to achieve the last -Joiner- node matching.

Hope it helps.

Best
Ael

takeAfew · April 6, 2023, 4:04pm

Hi @aworker , super thank you so much!

However, I would like to ask you a question, how does this similarity work if you know?

Having like 10 elements I would have expected to find an exponential number of results because every value in one table was compared to every value in the other table however it seems to me that this is not the case…

Can you help me understand this situation and if so how would I go about getting “exponential similarity”?

I hope to have given you enough information. if not I am always available to talk about it!
Thank you very much in advance to everyone for your help!

aworker · April 6, 2023, 5:00pm

Hi @takeAfew

I’m answering quickly but shortly because already out of office and hence answering from the mobile phone.

The -Similarity Search- node has mainly two options to control the number of pairs of matches returned at the end. One is the maximum number of returned pairs of matches w.r.t. each reference table row and the second is the maximum allowed distance. When sat, these two thresholds control the final number of total matched pairs.

Hope it answers your question.Otherwise please reach out again.

Best
Ael

takeAfew · April 17, 2023, 7:05am

Hi @aworker ,
I am finally back in the office after vacation and first of all thank you very much for your reply.

I didn’t understand too much about the two options for pair number control…

What I would need is to have all the possible pairing options, how would you recommend I do that?

In addition I see you know a lot about this, do you happen to know if you can do the analysis even to more than two “datasets”? For example with three/four starting columns (to make you understand the flow example above is at two)

How would you deal with this problem?

Thank you very much in advance to everyone for your help!

Daniel_Weikert · April 17, 2023, 4:19pm

“All options” depends on the distance threshold you set I suppose. If you want everything with everything do a crossjoin but I do not see any value in that for similarity matching
br

system · April 24, 2023, 4:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.