address Data Standardization

Hello everyone,

I am looking for a method to standardize an address
So, I have a field with several information such the first name, last name, address, city, postal code…

The idea is to identify each part and isolate the information in a new field.

For example, I will have a column with the first name, one with the last name, a column with the street number, one with the name of the street, one with the city, one with the Zip code…

The problem is that thoose informations are in a different order. I was thinking about using regular expressions but it seems to be difficult to find a pattern wich work all the times.

I would like to be able to use machine learning techniques, for example by creating an algorithm that could identify each piece of information, based on data that has already been clean. Perhaps with a multitude of data, the algorithm will be able to identify the name, the city …

Unfortunately I don’t know how the machine learning algorithms work in this case, but it’s something I’d like to learn how to use.

So if you can help me move forward on this project, I would be very grateful to you.
Standardization_data.xlsx (12.2 KB)

I am attaching an example file to show you the expected result.

Thank you

Hi @Grayfox , if you could at least separate the Names from the Address, there are a few tools, especially geolocation applications, that can standardize given addresses.

Google has an API that can do that, but I think it’s a paid service to use their api. You can check online, or if Community members might know of free ones.

So basically, you would contact these apis via a web service node from Knime and send the address that you have, and the api should send you back the standardized version of it.

4 Likes

Hi @bruno29a

Thank you for your solution. I am going to use you’re method.

I think I was too ambitious in wanting to create my own algorithm to identify addresses.

1 Like

No problem @Grayfox . Yes, it seems like it’s a whole science behind doing this, so it’s not quite straight forward to implement, plus I believe that these tools may also have a database of addresses that they may match against. I doubt you have this kind of data :slight_smile:

I saw a few free apis, you just need to sign up to get an api key. They limit the number of lookups per day, and also per second.

You may want to send them in batches (chunks) and add a few seconds delay (wait) between each iteration.

You can check https://www.geoapify.com/geocoding-api

It seems to be free. You can pass free-form address as GET Requests, such as text=whateveraddress. If you have spaces, just do url encode of the address.
For example, if you can pass the address “whatever address canada”, it would look like this:

https://api.geoapify.com/v1/geocode/search?text=whatever%20address%20canada&apiKey=YOUR_API_KEY

More details on that page.

1 Like

Hi @Grayfox , I put something together for you for Geoapify. You will just need to get an apikey from them, which is free.

I used the information from your excel file. The workflow looks like this:
image

I basically used the first column (Source Field), but removed the first and last names from the column, and created a URL for the geoapify’s api, for each rows:

The API KEY is set as “Enter API KEY here” in the workflow as you will need to enter your own API KEY.

After that, it will run in chunks of 5, and wait for 5 seconds between each batch/iteration:
image

Here’s the workflow:
Geoapify address standardization.knwf (28.2 KB)

You’ll just need to add your API KEY.

7 Likes

Thank you @bruno29a

I am very impressed with the work you have done.

Your workflow will help me a lot.

Again, thank you very much.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.