How can I define and list the duplication in an adress data set sucessfully, with using String Distances node and Similarity Search node?

DAmei · May 6, 2022, 12:12pm

Hello everyone,

I have found an example workflow in KNIME Hub, which could be potentially very helpful to find out adress duplicates in a given reports. (access to the Example Workflow)

But… I have no idea how can I use the node “String Distances” to configure, for example “Distance Selection”, “Gram Size”? Whether it is necessary to using “Groupby” also in my case study? (I show you here my dataset for testing and also my workflow). Concerning to my user case, is it possible to customize distance measure and then to use the Similarity Search node in order to normalize account names and addresses according to my case study? How can I do it well?

Thank you very much in advance for your comments or your correction of my workflow.

Test_Datenquelle_Duplicate.xlsx (18.6 KB)
Dubletten Adressen_MZ.knwf (76.2 KB)

izaychik63 · May 6, 2022, 12:24pm

Use Cross Joiner — NodePit
and
String Similarity — NodePit
nodes

victor_palacios · May 6, 2022, 6:06pm

Every node has a description where you can see the different Distance Selection options and some information about them. Since these are each different algorithms, you then need to look up the material concerning that algorithm to use it correctly. To do so, simply click on the orange text for each algorithm and you will be taken to the relevant wikipedia page.

What do you mean by “use the Similarity Search node in order to normalize account names and addresses”. The term normalization has a very specific meaning in NLP (natural language processing), so I just want to be sure we are talking about the same thing.

Something else to note is that the example you used had labeled data which you do not seem to have (?) I don’t speak German so I’m not sure. If you could provide english column names, that would be great.

So your approach will be slightly different. For instance, I did Excel → String Manipulation (strip(lowerCase($Firmenname$))) → Groupby and took your original list down by 20 items.

If you could explain your rules to determine what is a “match” for a company name that would be great.

mlauber71 · May 6, 2022, 8:44pm

@DAmei welcome to the KNIME forum.

I have a collection of links and articles about address matching.

I have to see when I might have a look at your data.

DAmei · May 9, 2022, 11:39am

Thanks for the collection of links about address matching. Actually, the workflow example, which I have used, is from this collection

DAmei · May 9, 2022, 11:41am

Thanks for your tips. I will also try to test cross Joiner node and String Similarity node in my workflow.

DAmei · May 9, 2022, 12:20pm

Hello Victor,

thank you very much for your comments. I have already done some research about Jaro-Winkler Distance, N-gram Tversky Distance, Levenshtein Distance etc. To be honesty, it is difficult for me to make general understanding about all of these definitions, because I do not have any background in this field…

But… I have still tried to figure out how I can get a decent result by changing diffferent N-gram, for example, 8-gram size, 15-gram size or 28-gram size in String Destances node for streename, 1-gram size oder 5-gram size in another String Distances node for house number etc. But it doesnt work well. I think, you have given me a good idea. I should at first use String manipulation and Groupby to combine the streename and house number in one column together. It makes the next step "String Distance" easier, doesnt it? I will try it at first.

My rule is actually very simple and I just want to find out the data, which hat not only the same/similar companyname but also the same address (the same streetname, house number and in the same city).
I have uploaded my testing data in english. Thank you very much for your time and your comments in advance.

DAmei · May 9, 2022, 12:22pm

Test_Datenquelle_Duplicate.xlsx (18.7 KB)

DAmei · May 9, 2022, 12:25pm

I have uploaded my dataset in english and german.
I will be very grateful, if you can help me further :).

Daniel_Weikert · May 9, 2022, 4:57pm

Is the Address already qualifying your duplicates (column C-G). If so you can just use duplicate row filter for that.
br

victor_palacios · May 10, 2022, 1:31am

I think 2 groupbys is sufficient for this task.

The results take your 62 down to 40 results reasonably well.

You first groupby the street name, house number, postal code, and the city (this generates a list of unique rows with those 4 matching reducing your list by 11 rows):

You then want to use “First” in the “Manual Aggregation” tab to keep the relevant data:

In the second groupby, look for company names that match:

If this is still not sufficient, then please manually label all your rows as “original” or “duplicate”. That is, make a new column and label each data point so that we can try making a more complex workflow.

mlauber71 · May 10, 2022, 3:56pm

@DAmei I had another look and I built a slightly over engineered workflow to try and get what you want combining some of the approached already mentioned. In your case there might not be a ground truth against wich to match but rather a lot of possible combinations.

So configuration will be necessary and maybe exclude some obvious detaches in the first place. What the workflow does:

the address is turned into a standardised string. The composition of the string will influence the matching. So for example if you want to stress the importance of the street you might add that a few times. Or you might just want to use the start of the Company name - because this mostly is consistent
the you would generate a Counter as an artificial ID and start fron the top
you match all lines that would match the first row according to a Rang Filter (about the distance). If you think the matching is too wide you might want to restrict that settings further down from 25 (0.25)
the workflow will store the already matched lines at each step and if there is no match the line is a single entry

With that you should be able to build the groups of matching addresses. Each address can only be chosen once. You might have to tweak this approach

The result would have a Parent entry and add all the ‘children’.

I might have another look at further techniques mentioned in the forum/hub and bring them into a similar approach.

DAmei · May 18, 2022, 9:55am

Hello Daniel,

thanks for your comments and the further question. Actually, the concrete qualified duplicates are not available, which can be used to compare with. I just want to find out which could be duplicates or just the possible duplicates, if they have the same adresse and the similar firmname etc.

There are more than 150 thousand data set in the report and it is impossible that I finde out them manually. Also, wir try to aviod to use excel functions.

DAmei · May 18, 2022, 10:09am

Hello mlauber71,

thank you very much for your attention and your comments . Is that possible that you share me your workflow above-mentioned, in which the loop nodes are used? I would really like to make a better understanding of this workflow and then try to customize one for my question.
Thanks a lot in advance .

DAmei · May 18, 2022, 12:21pm

Hello Victor,

thank you very much for your comments. I have inserted two gourpbys into my workflow. However, I can not get a decent result at the end of the workflow. I think your idea very meaningful and the question now might be… how can I tweak the parameters in the String Distances nodes for street number and house number? I have uploaded my freshest workflow and maybe you have a better idea?

DAmei · May 18, 2022, 12:25pm

freshest Workflow_Deduplication_of_Address_Data.knwf (112.5 KB)

Please let me know about your comments, ideas, discussion or possible solutions about my workflow and my question. Thank you very much in advance!

mlauber71 · May 18, 2022, 12:30pm

@DAmei were you able to download this workflow from the hub using your data?

DAmei · May 18, 2022, 12:49pm

so cool . Yes, I can download this workflow now. Thank you very much for sharing your workflow with me . I will firstly try to understand the whole concept and the relationship among the nodes you have used. Thank you @mlauber71

DAmei · May 25, 2022, 3:06pm

Hello @mlauber71,

Thank you very much for sharing your workflow with me. The first part of your workflow helps me with selecting the absolut duplicates very well. I was soooo excited about the result. Thanks!

After that, I have met a tiny problem in your workflow. I am wondering if you could tell/show me how did you generate the Table Writer nodes “temp_store_remain.table” and “temp_store_processed.table”? Because I could not see what happened inside. Furthermore, the workflow from the Table Reader node “temp_store_remain.table” can not be executed any more. So I have no idea what could happen at the end of the workflow.

Thanks for your comments in advance and wish you have a nice day.

mlauber71 · May 25, 2022, 3:15pm

@DAmei galt you liked the initial approach. You should be able to reset the whole workflow and let it run again to see what is going on. The temp tables are getting created once before the start of the loop. They will have changed at the end so in order to start the loop anew you would have to recreate them.

You can also do the loop step by step and see what happens in each step. The logic is: once an address is chosen it cannot be used for other matches.

All files should then be under the workflow in the /data/ folder