Hi!!! Knime Experts
Hope you are doing well.
Just need your help to compare certain words, so that i can progress in my workflow.
Name of Vendor:
Qwerty Private Limited
Qwerty Private Ltd.
Qwerty Pvt. Ltd.
All the three name is for single vendor, but in KNIME it’s showing 3 unique name, what steps shall i use or what changes shall i do that knime would consider all 3 as same.
Thanks in advance.
You can replace Pvt. and Ltd. (and any other similar abbreviations) with their complete forms or vice versa.
@armingrudd Thanks, but it is not only for this vendor, their are many vendor, can you suggest any Specific node that I can use.
Can you provide more examples?
- Limited vs Ltd
- Ltd vs Ltd.
- Ltd. vs Limited
- Pvt vs Pvt.
- Pvt vs private
- private vs Pvt.
- Same name having extra character i.e. “.”,“Space”,","etc.
We need to give the condition that knime need to consider as one name.
Check this Forum Post About finding duplicatea compagnies.
just 1 question, how to get string similarity node?
that node is not in my arsenal
The String Similarity node comes with the Paladian nodes. Additional extensions are available by enabling the Update Site in KNIME via File -> Preferences -> Install/Update -> Available Update Sites Add: http://download.nodepit.com/palladian/4.0 And then it can be installed: by selecting File -> Install KNIME Extensions (source: https://www.knime.com/community ).
Hi!! @HansS but using String similarity we gets only percentage of similarity. How to further use it to consider as one name.
That is a good question. I think you need a human in the loop. You have to find out, to what percentage you can go, to be sure (to some level) it is the same company. Maybe it is a good step to first “clean” your data and enrich it with some business logic, translated to rules (as @armingrudd suggested).
Still here ,involvement of human how will it benefit ?
All names are of same vendor, but one is 72.2% same & another is only 41.2% how to solve this issue, + this is only example for 1 vendor there are many more, in that situation we need to give different range for different vendor, it will become little bit tedious.
please guide …
You may need to play with n value for n-gram. and see which one is working better.
@armingrudd Can you please guide me, I am stuck here?27_Data.xlsx (85.3 KB)
This are name of vendors
As @HansS suggested, you can first clean the strings (a bit) and then find similar names to replace.
Here I have built a workflow. I hope it helps. And of course, any feedback would be appreciated (@mlauber71, @ipazin, @HansS).
23094-1-1.knwf (388.0 KB)
I have set the threshold to 0.8 but you can change it to your desired value.
Hi!! @armingrudd really kind of you.
Hi there @armingrudd,
didn’t look in details but seems in such cases Index Query node is better option. Used it couple of times and was really satisfied with outcome. I can try to create example and compare workflows and results.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.