How can I define and list the duplication in an adress data set sucessfully, with using String Distances node and Similarity Search node?

Hello Victor,

thank you very much for your comments. I have already done some research about Jaro-Winkler Distance, N-gram Tversky Distance, Levenshtein Distance etc. To be honesty, it is difficult for me to make general understanding about all of these definitions, because I do not have any background in this field… :sweat_smile:

But… I have still tried to figure out how I can get a decent result by changing diffferent N-gram, for example, 8-gram size, 15-gram size or 28-gram size in String Destances node for streename, 1-gram size oder 5-gram size in another String Distances node for house number etc. But it doesnt work well. I think, you have given me a good idea. I should at first use String manipulation and Groupby to combine the streename and house number in one column together. It makes the next step "String Distance" easier, doesnt it? I will try it at first.

My rule is actually very simple and I just want to find out the data, which hat not only the same/similar companyname but also the same address (the same streetname, house number and in the same city).
I have uploaded my testing data in english. Thank you very much for your time and your comments in advance. :slight_smile:

Test_Datenquelle_Duplicate.xlsx (18.7 KB)

I have uploaded my dataset in english and german.
I will be very grateful, if you can help me further :).

Is the Address already qualifying your duplicates (column C-G). If so you can just use duplicate row filter for that.
br

1 Like

I think 2 groupbys is sufficient for this task.

The results take your 62 down to 40 results reasonably well.

You first groupby the street name, house number, postal code, and the city (this generates a list of unique rows with those 4 matching reducing your list by 11 rows):

You then want to use “First” in the “Manual Aggregation” tab to keep the relevant data:

In the second groupby, look for company names that match:


If this is still not sufficient, then please manually label all your rows as “original” or “duplicate”. That is, make a new column and label each data point so that we can try making a more complex workflow.

3 Likes

@DAmei I had another look and I built a slightly over engineered workflow to try and get what you want combining some of the approached already mentioned. In your case there might not be a ground truth against wich to match but rather a lot of possible combinations.

So configuration will be necessary and maybe exclude some obvious detaches in the first place. What the workflow does:

  • the address is turned into a standardised string. The composition of the string will influence the matching. So for example if you want to stress the importance of the street you might add that a few times. Or you might just want to use the start of the Company name - because this mostly is consistent
  • the you would generate a Counter as an artificial ID and start fron the top
  • you match all lines that would match the first row according to a Rang Filter (about the distance). If you think the matching is too wide you might want to restrict that settings further down from 25 (0.25)
  • the workflow will store the already matched lines at each step and if there is no match the line is a single entry

With that you should be able to build the groups of matching addresses. Each address can only be chosen once. You might have to tweak this approach

The result would have a Parent entry and add all the ‘children’.

I might have another look at further techniques mentioned in the forum/hub and bring them into a similar approach.

Hello Daniel,

thanks for your comments and the further question. Actually, the concrete qualified duplicates are not available, which can be used to compare with. I just want to find out which could be duplicates or just the possible duplicates, if they have the same adresse and the similar firmname etc.

There are more than 150 thousand data set in the report and it is impossible that I finde out them manually. Also, wir try to aviod to use excel functions.

Hello mlauber71,

thank you very much for your attention and your comments :smiley:. Is that possible that you share me your workflow above-mentioned, in which the loop nodes are used? I would really like to make a better understanding of this workflow and then try to customize one for my question.
Thanks a lot in advance :blush:.

Hello Victor,

thank you very much for your comments. I have inserted two gourpbys into my workflow. However, I can not get a decent result at the end of the workflow. I think your idea very meaningful and the question now might be… how can I tweak the parameters in the String Distances nodes for street number and house number? I have uploaded my freshest workflow and maybe you have a better idea?

freshest Workflow_Deduplication_of_Address_Data.knwf (112.5 KB)

Please let me know about your comments, ideas, discussion or possible solutions about my workflow and my question. Thank you very much in advance!

@DAmei were you able to download this workflow from the hub using your data?

2 Likes

so cool :smiley:. Yes, I can download this workflow now. Thank you very much for sharing your workflow with me :blush:. I will firstly try to understand the whole concept and the relationship among the nodes you have used. Thank you @mlauber71 :pray:

3 Likes

Hello @mlauber71,

Thank you very much for sharing your workflow with me. The first part of your workflow helps me with selecting the absolut duplicates very well. I was soooo excited about the result. Thanks!

After that, I have met a tiny problem in your workflow. I am wondering if you could tell/show me how did you generate the Table Writer nodes “temp_store_remain.table” and “temp_store_processed.table”? Because I could not see what happened inside. Furthermore, the workflow from the Table Reader node “temp_store_remain.table” can not be executed any more. So I have no idea what could happen at the end of the workflow.

Thanks for your comments in advance and wish you have a nice day.

@DAmei galt you liked the initial approach. You should be able to reset the whole workflow and let it run again to see what is going on. The temp tables are getting created once before the start of the loop. They will have changed at the end so in order to start the loop anew you would have to recreate them.

You can also do the loop step by step and see what happens in each step. The logic is: once an address is chosen it cannot be used for other matches.

image

All files should then be under the workflow in the /data/ folder

Hi @mlauber71

thank you for your answer. I have tried to start “step loop Execution” in order to execute the rest of workflow. But it does not work very well. Let me show you what happened in the following screenshot.

@DAmei I think you will have to reset the whole workflow and start again. The two temp files will have to be fresh at the beginning of the loop. Some steps in the loop will not have a match since they might be unique.

thank your @mlauber71. let me try it again :wink:

Hi @mlauber71,

after I reset the workflow, I got the result like this (see the screenshot). I am wondering if you could tell me how I can use the data of Distance in the last column of the list to define the potential duplicate?

@DAmei you could use the slider to set a value between 0 and 100 - the lower the more strict the definition of a duplicate gets. In your example data there is a seeming identical value while the name is very long in the field. You will have to experiment with the settings.

And also you might think about limiting the number of characters you would use to construct the company name. If you want to stress the importance of a component you could use that a few times, to give it more weight.

These would be examples to be shorten the names if you want to have them as duplicates.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.