Adress split-help for a rookie

Hello All,

 

I am into Knime for about a week now, so looking still at the very basics yet and trying to get my head around how it works. So far quite pleased...

I have a task I need to do and ask myself if Knime is the right tool for this.

I have a list of adress data (ca. 5000 lines) which I need to split.

The returned adress from the source file is a single (SQL?!) string which is very inconsistant as it is from various countries-so no chance to use e.g split by position or blank/delimiter.

I need to split it always to

Street name

House number

Post code

Town

Country.

Unfortunately the positions and lenghts of the elements are always different, as are the formats like e.g. in one case "15 test-road" as e.g. in France/UK and then "testroad 15" in e.g. Germany (not to speak about different post code lenghts and formats).Also an element can consist of a different number of words, e.g. "Teststraat" vs. "Rue de Test"

I have found smth about the steps to do test mining, but much of it is still chinese to me and I struggle what values to set in the nodes, e.g. for the tagging.

SAMPLE DATA:

10 FLEET PLACE LONDON EC4M 7RB United Kingdom

KAPELANIELAAN 8 9140 TEMSE Belgium

VIA SILVIO PELLICO N 6/8 20089 ROZZANO MI Italy

VIA BRIGATA REGGIO 27 42124 REGGIO NELL'EMILIA RE Italy

PORTE DES BÂTISSEURS(EST) 20 7730 ESTAIMPUIS Belgium

BOOMSESTEENWEG 957 2610 ANTWERPEN Belgium

Prumyslová 1428/10 PRAHA 15 - HOSTIVAR 102 00 PRAHA 102 Czech Republic

P O BOX 79 SWALLOWDALE LANE HEMEL HEMPSTEAD HERTS HP2 7HA United Kingdom

MERWEDEWEG 00007 3336LG ZWIJNDRECHT The Netherlands

Elektrárenská 4 83104 Bratislava 3 - Nové Mesto Slovakia

VIA GALILEO GALILEI 40 20092 CINISELLO BALSAMO MI Italy

2 MIDLAND WAY BARLBOROUGH LINKS DERBYSHIRE S43 4XA United KIngdom

VIA ALBERTO BERGAMINI 50 00159 ROMA RM United Kingdom

VIA CAMILLO OLIVETTI 2 20864 AGRATE BRIANZA MB Italy

VIA CARPINETANA NORD SNC 00034 COLLEFERRO RM Italy

VIA GIUSEPPE VERDI 0004 20090 ASSAGO MI Italy

VIA DESENZANO 7 20146 MILANO MI Italy

101 RUE BONGARDE 92230 GENNEVILLIERS France

PARC D'AFFAIRES ROOSEVELT 9 RUE DE LA PERLERIE 69120 VAULX EN VELIN Italy

Areal EDWARDS 101 RUE BONGARDE 92230 GENNEVILLIERS France

Any help or tips are welcome :-)

Jürgen

 

 

Hi Jürgen,

there are few ways to solve your problem, but the quickest  I can think of is to "outsource" it to one of the many free geocoding services from companies that have invested a lot of time and money in perfecting their algorithms.

For example: https://developers.google.com/maps/documentation/geocoding/start

Once you call the API, the result is returned as a JSON dataset which you can manipulate/convert in KNIME.

If you want to solve the task entirely in KNIME, which I am sure it is also possible, I would suggest you start from the end of the string by isolating the Country, then work your way back. Each country has different rules for their addresses, so by knowing the country you can switch between different sets of rules/cases.

Regular expressions are often employed for such tasks. See for example the Regex Split node.

I hope this points you in the right direction. Keep posting here if you need more help on specific steps.

Cheers,
Marco.