Index and query addresses

Demonstrates the usage of the Indexing & Searching plugin for address data base cleansing. The workflow uses fuzzy queries to identify typos in a given address data base. First an index is created on the data table. Once the index has been created it can be queried using a powerful query language. The query syntax bases on the Lucene query syntax and supports among others phrase queries, wildcard queries, fuzzy queries, proximity queries, range queries, term boosting, grouping and boolean operators.


This is a companion discussion topic for the original entry at https://kni.me/w/f9zfN5zhDNPHEP54

Hi, I have checked many post and this workflow seems to be the best for fuzzy address matching I have found so far! The issue I am facing is that the dataset used in the example assumes a “clean and complete” dataset (i.e. cannot cope with nulls in different columns) here is an example of the original dataset I am trying to fuzzy match:

image

As you can see from the image the issue is that data entry hasn’t been great and I end up with records in just one column. I thought the loop query could deal with the null values as it seems to put all records into a single string query (unless I am misinterpreting the workflow?).
The problem I have now is that the workflow does not allow me to progress with data structured in this way:

Would anyone have an advice on how to overcome this issue and proceed with the fuzzy matching? thanks in advance

Hi @B074534 , I remember about 5 years ago being presented with a data migration to a new system that required addresses to be practically perfect even though the old system had similar data entry issues that you are facing. The end result was me sitting with a spreadsheet, plus google, painstakingly trying to manually fix (with a little automation) a few thousand records. Luckily it wasn’t more.

The trouble is, that addresses can be typed in so many ways that it is almost impossible to fix perfectly so I am sure you realise the best you can achieve is an improvement in what you already had.

I have been thinking over this for the past day and had thought up one plan involving downloading global town/partial postcode information that I found at geonames.org, with the idea being that if you could determine the country that the address belonged to, you could better gain an understanding of the standard address format for that country and then maybe pull other information out. I actually put together the start of that flow, and it successfully identified the countries for each of your sample addresses, but it was probably going to take a long time to run over a large data set.

So I set about thinking of something more simple that would probably be about the same effectiveness in a greatly reduced timespan.

This cannot be perfect, but for a good many cases, looking at a standard street address in whichever format, the first numeric sequence encountered is the building number, and the next numeric encountered is the postcode.

I also assumed that the address will always be divided by commas, but I know that in reality this isn’t always true. I don’t know a good answer to that part.

Anyway so I worked on the basis of these certain assumptions and attached is the workflow I produced. Maybe it can give some ideas on how to improve, or maybe you can think of some other refinements for specific cases that you can identify.

I made a spreadsheet based on your sample data

And it takes this and produces new columns predicting the address components:

KNIME_infer_addresses.knwf (78.6 KB)

If the UK postcodes had been embedded only within the Street address, rather than already isolated, then these would have to have been detected separately (probably some first stage regex looking for UK postcode patterns and pulling those out first.

Anyway, I hope it helps, or at least gives some ideas.

3 Likes

Hi @takbb . Google also has an api that can “standardize” addresses, but I think it’s not free.

2 Likes

Hi @bruno29a , yes I did see that. One day it would be interesting to give it a try. I guess with a big database of place names, some AI, and lots of processing power they would hopefully make a reasonable job of it! :wink:

Address data has to be one of the worst for poor data-entry, along with it being a data set that has so much variation… it’s so much fun! :slight_smile:

Hi, @takbb Thanks for the solution provided, unfortunately I haven’t been able to test it yet because couple of nodes couldn’t be loaded:

I tried installing NodePit to try and get Palladian nodes but NodePit plugin is not playing ball, does anyone have any suggestions how to get around this?

hi @B074534 ,

Re nodepit, if you haven’t used that before, have you created the nodepit account, which I think (iirc) needs to be done separately? I suspect you have, so apologies if that sounds like I’m being patronising - just covering possibilities. :wink:

Also have you tried installing Palladian nodes directly from File - Install KNIME extensions…
image

1 Like

Please see the thread below on how to access the Nodepit update site and get access to those nodes:

3 Likes

Hi @takbb, thanks for your prompt reply. Yes I have created the NodePit account and followed the steps to add my login details, thanks!

Regarding the second step, I haven’t tried installing the Palladian nodes directly from file. I will give it a go tomorrow as I am logging off for the day…

Are we not supposed to use this? Palladian for KNIME [Feature] — NodePit
What’s the difference with the post you referred to?

I can see that the URL’s are different:

I assumed the one with the latest date was the one to use?

Yes, use the updated URL. I mainly just wanted to point you to @qqilihq 's explanation, since his team develops those nodes :slight_smile:

2 Likes

I have got this error now:

Any clues how to overcome this?

For some reason I don’t have palladian as an option to install:

Hi @B074534 , I’m not sure why either of those things are occurring, but the “Set Values to Missing” node is actually just a ciomponent, so in theory you could lose the “unable to update” issue by right clicking on it and choosing “disconnect link” from the component menu.

It won’t get any updates, but that didn’t ought to be a problem at least for this flow.

If you wish to try to get it connected propertly, you could replace the one on the workflow with the one on the hub.
The component is available from here:

It would be a drop-in replacement and just needs the “Missing Value Placeholder for Strings” to be made changed from “UNKNOWN” to a zero length string (i.e. delete the word UNKNOWN)
image
→ image

Palladian Nodes:

This is my location entry on “Available Software Sites”, which can be managed via the Manage button on the “Install” dialog, or via File - Preferences menu

http://download.nodepit.com/palladian/4.3

1 Like

@takbb thanks for your reply.

I don’t know why I cannot see this buttons when I open the Install window:


This is what I can see:

@B074534
Sorry… my bad… kind of :wink:

There are two Install menus! The one off the File Menu (Install KNIME Extensions) is I think the one you are seeing without those buttons.
Try the one off the Help menu (Install New Software), which in my case does have those buttons. I’m not sure if one is a subset of the other, or totally different…so I hope I’m not giving out wrong information here…

1 Like

Thank you again, this was really the solution! Look at the nodes now:

3 Likes