Remove duplicated words or terms for text in a cell

Hi,

I have a very dirty list of 8000 addresses that I need to clean up and geocode. The geocoding will be easy if I can clean up the addresses.

Is there a way for me to remove duplicate words/terms from a cell? For example :

VOORTREKKER STREET, PORTERVILLE, PORTERVILLE, CAPE TOWN
19 Auckland Street, 19 Auckland St Paarden Eiland, Cape Town
84 Harrington Street, 84 Harrington St Zonnebloem, Cape Town

Thanks,

tC/.

Hi @TigerCole

An option (but maybe a bit tricky) is to split the column with a Cell Splitter node (on whitespaces; remove input column) and after that aggregate all the columns together with the Column Aggregator node , option Unique Concatenate .

gr. Hans

Hello @TigerCole,

seems you are not dealing just with duplicates within one cell but also similar values like 19 Auckland Street vs 19 Auckland St so this does require a bit of work in my opinion. I would try splitting based on comma to get every value in separate column followed by similarity calculation between each value and logic to determine which value to keep (or drop). Once done combine values back and go on next row. This approach does require loop. For similarity calculation I suggest to search forum and Hub as there are many topics/workflows around it.

Br,
Ivan

2 Likes

19 Auckland Street vs 19 Auckland St or 19 Auckland Street vs 19 Auckland. The St could also stand for Saint where it could be Saint Paarden Eiland written as St Paarden Eiland, similarly Saint Zonnebloem as St Zonnebloem.

Basically, as @ipazin said, it’s not just dealing with duplicates. When it comes to address standardization, it’s a nightmare.

@TigerCole , a lot of geolocation tools take free text, as in you don’t have to specify which part is the street, which part is the city, etc… you would basically feed the lines that you have as is, and they do their magic in the backend to recognize the street, city, country, etc.

3 Likes

@TigerCole which geocoding service do you use (just curious). Also if you go with Bruno’s proposal feel free to share info about the result.
Thanks and best

Hi @Daniel_Weikert … I am still working on a solution, but each path has its compromises.

I am using the Google Geocoder Node at the moment just because it is simple and I can check results quickly. I also use the REST GET Request node to query the OSM API which is always free but is a little bit more work to get it working in KNIME and the results are not always as good as Google depending on where locations are. We save verified results to a local Gazetteer database.

When I find a workable solution, I will post it in the forum.

tC/.

1 Like

Hi @HansS … this sounds interesting. It is more or less along the lines of what I was looking to do but, as you said a bit tricky when the data contains things like “Street” and “St”. I am going to go down this rabbit hole for a while and see what happens.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.