Extract terms in a list from text string

Hello All,

I have a problem I need to solve in processing addresses.

I have a list of ca. 50 counties in the UK and a table of 2500 UK adresses.

The counties are contained in each of the 2500 text strings.

I need a kind of (fuzzy/not exact match) v-lookup function, which reads the list of counties and checks each of the rows for each of the counties in my list.

If a county name is found, it is returned in a new column.

Example:

So list of values to be checked in table/counties is:

county 1

county 2

county 3

Adresses in table are (all just one string, no commas/delimiters between):

Street 1 county 3 12345 town

Street 1 county 1 56789 town

Street 1 county 4 12121 town

Street 3 county 2 88888 town

Results would be in new column

county 3

county 1

blank (as county 4 is not on the list of values to check)

county 2

how can I achieve that. checked string manipulation and some of the dict. splits, but the latter just filter. I want to extract the list values from the strings, so no luck till date.

Thanks a lot,

Jürgen

Hi Jürgen,

see the attched workflow which solves your use case using KNIME's Text Processing capabilities. I have commented it as much as possible. Feel free to reply back here if something is unclear.

Cheers,
Marco.

Hi Marco,

many thanks, that worked quite nicely :-)

just another quetsion on this, please..

When I tried to extend this a bit onto the street names I had quite a low "hit" rate.

After which logic is this node searching...?

Example is the below street names contained in a string (exactly this spelling, position of blanks etc.).

Just the 10th one was found as "Heuston South quarter".

The settings were "Exact match" and "Set named entities unmodifieable."

Unticking "exact match" is increasing the hit rate, yet includes parts not belonging to the dict. list in this street name I put in the dict. list. So not an option.

List of adresses to be searched:

PEMBROKE HOUSE ,UPPER PEMBROKE STREET 28-32 ,DUBLIN 2
4 - 6 RIVER WALK ,CITYWEST BUSINESS CAMPUS ,SAGGART ,DUBLIN 24
1ST FLOOR ,118 LOWER BAGGOT STREET ,DUBLIN 2
METROPOLITAN BUILDING ,JAMES JOYCE STREET ,DUBLIN 1
B11 ,BALLYMOUNT CORPORATE PARK ,BALLYMOUNT ,DUBLIN 12
175 IVY EXCHANGE ,PARNELL SQUARE WEST ,DUBLIN 1
PO BOX 201 ,BLUEBELL ,DUBLIN 12
6 FAUGHART TCE ,ST MARYS RD ,DUNDALK ,CO LOUTH
3RD FLOOR KILMORE HOUSE ,PARK LANE ,SPENCER DOCK ,DUBLIN 1
MASTER DATA CENTRE 3C31 ,1 HEUSTON SOUTH QUARTER ,ST JOHNS ROAD ,DUBLIN 8
21 BECKETT WAY ,PARKWEST ,DUBLIN 12

Dictionairy entries

UPPER PEMBROKE STREET
CITYWEST BUSINESS CAMPUS
LOWER BAGGOT STREET
JAMES JOYCE STREET
BALLYMOUNT CORPORATE PARK
PARNELL SQUARE WEST
BLUEBELL
ST MARYS RD
PARK LANE
HEUSTON SOUTH QUARTER
PARKWEST

Found streets in output table:

Just Heuston South quarter

What went wrong here, any idea? I would like to get exactly just what is in the dict list, so include a few thousand street names from alist I got to the dict. and find just exact these part if they are contained.

Thanks a lot,

Jürgen

 

 

Hi Jürgen,

I have looked into your problem. It appears that the lack of matches happens in cases where the string to match starts with a comma. To solve this, I have expanded Marco's example workflow to include a String Manipulation node to create a new column in which commas are removed. By doing so, I was able to match all the street names.

Hope that helps! :-)

Best,

Roland