I have a problem I need to solve in processing addresses.
I have a list of ca. 50 counties in the UK and a table of 2500 UK adresses.
The counties are contained in each of the 2500 text strings.
I need a kind of (fuzzy/not exact match) v-lookup function, which reads the list of counties and checks each of the rows for each of the counties in my list.
If a county name is found, it is returned in a new column.
So list of values to be checked in table/counties is:
Adresses in table are (all just one string, no commas/delimiters between):
Street 1 county 3 12345 town
Street 1 county 1 56789 town
Street 1 county 4 12121 town
Street 3 county 2 88888 town
Results would be in new column
blank (as county 4 is not on the list of values to check)
how can I achieve that. checked string manipulation and some of the dict. splits, but the latter just filter. I want to extract the list values from the strings, so no luck till date.
Thanks a lot,
see the attched workflow which solves your use case using KNIME's Text Processing capabilities. I have commented it as much as possible. Feel free to reply back here if something is unclear.
many thanks, that worked quite nicely :-)
just another quetsion on this, please..
When I tried to extend this a bit onto the street names I had quite a low "hit" rate.
After which logic is this node searching...?
Example is the below street names contained in a string (exactly this spelling, position of blanks etc.).
Just the 10th one was found as "Heuston South quarter".
The settings were "Exact match" and "Set named entities unmodifieable."
Unticking "exact match" is increasing the hit rate, yet includes parts not belonging to the dict. list in this street name I put in the dict. list. So not an option.
List of adresses to be searched:
|PEMBROKE HOUSE ,UPPER PEMBROKE STREET 28-32 ,DUBLIN 2
|4 - 6 RIVER WALK ,CITYWEST BUSINESS CAMPUS ,SAGGART ,DUBLIN 24
|1ST FLOOR ,118 LOWER BAGGOT STREET ,DUBLIN 2
|METROPOLITAN BUILDING ,JAMES JOYCE STREET ,DUBLIN 1
|B11 ,BALLYMOUNT CORPORATE PARK ,BALLYMOUNT ,DUBLIN 12
|175 IVY EXCHANGE ,PARNELL SQUARE WEST ,DUBLIN 1
|PO BOX 201 ,BLUEBELL ,DUBLIN 12
|6 FAUGHART TCE ,ST MARYS RD ,DUNDALK ,CO LOUTH
|3RD FLOOR KILMORE HOUSE ,PARK LANE ,SPENCER DOCK ,DUBLIN 1
|MASTER DATA CENTRE 3C31 ,1 HEUSTON SOUTH QUARTER ,ST JOHNS ROAD ,DUBLIN 8
|21 BECKETT WAY ,PARKWEST ,DUBLIN 12
|UPPER PEMBROKE STREET
|CITYWEST BUSINESS CAMPUS
|LOWER BAGGOT STREET
|JAMES JOYCE STREET
|BALLYMOUNT CORPORATE PARK
|PARNELL SQUARE WEST
|ST MARYS RD
|HEUSTON SOUTH QUARTER
Found streets in output table:
Just Heuston South quarter
What went wrong here, any idea? I would like to get exactly just what is in the dict list, so include a few thousand street names from alist I got to the dict. and find just exact these part if they are contained.
Thanks a lot,
I have looked into your problem. It appears that the lack of matches happens in cases where the string to match starts with a comma. To solve this, I have expanded Marco's example workflow to include a String Manipulation node to create a new column in which commas are removed. By doing so, I was able to match all the street names.
Hope that helps! :-)