String matching in one column

Hi!

I have a question regarding the matching of strings in one column.

There are three adress columns, which are partially filled but also contain empty entries. Initially, I replaced all empty entries with _ and combined the three address columns into a single one. To simplify, I only kept this adress column and the corresponding name column in the table. Then I removed all _.

My problem is that I don’t have two columns but one to compare. When testing with string similarity node every entry in the similarity column has the value 1 because I could only compare one address with itself.

How can I check if these addresses match or show similarities?

Thanks in advance!

Hey there,

without seeing your data it is a bit difficult to help with concrete advise.

I happen to recall that there was a Just Knime It Challenge on a very similar topic - here is the link to the “official” solution:

You can find the original challenge here (scroll down to Challenge 10):

More community solutions are linked in this thread:

https://forum.knime.com/t/solutions-to-just-knime-it-challenge-10-season-3/81162/26

And my own, very minimal, solution is here:

I think if you check out some of these solutions you should get a good feeling on how String Similarity can be used to address your problem.

1 Like

Hi Martin, thanks for your quick response!
In the workflows you sent, there is the column “Typo”, which I don’t have in my case.

I’ll make a quick example of what the table could contain:

Before merging the three address columns:

After merging:
image

Now I need to check how similar entries of this column are. The main cause is to get rid of duplicates, so duplicates musst be identified, even if they differ in order or if there are parts missing. This should work for bigger tables with about 100k rows and multiple addresses.

Thanks for the additional information.

I always recommend to provide an example WF that holds the example data - then people like me can quickly try out some things and get to a potential solution…

In the absence of that: Now that you have your list, one option is to first group by your new column and get the count of those entries, that appear multiple times.

Where the count is >1 you have your first duplicate flag.

Then you can filter the group by table for anything bigger than 1 and remove those entries from your main table via reference row filter.

The remaining ones you can then send into a chunk loop of size 1 and inside this loop you run each row and the remaining main table into string similarity - you’ll also need to build additional logic to flag your duplicates somehow - but that’s impossible to describe w/o data.

Hope this can help you to progress.

1 Like

Hi,
I would try following. Remove all non-alphanumeric signs in the strings and then sort each string.
After that you can apply the “remove duplicates” node.

So I’m not aware if sorting of strings can be done with a KNIME node but it is possible in python:

import knime.scripting.io as knio
import pandas as pd

# This example script simply outputs the node's input table.
df = knio.input_tables[0].to_pandas()

# Function to sort letters in a string
def sort_letters(s):
    return ''.join(sorted(s))

# Apply the function to the column
df['sorted_strings'] = df['adress'].apply(lambda x: sort_letters(x))

# Output
knio.output_tables[0] = knio.Table.from_pandas(df)

For example:

      strings sorted_strings
0       hello        ehllo
1       world        dlorw
2      pandas        aadnps
3  dataframe   aadeefmrt
1 Like

Hi Andy,

thanks very much for your suggestion. The algorithm would work if the entries are correctly filled. The problem I have relates to inconsistently entered entries. I made up an example-table for better understanding.
stringmatchingtest.knwf (70.7 KB)
Maybe this helps to clarify or to test.

Thanks in advance.

Hi Martin!

I tried implementing a workflow like you suggested but after joining the table (which I thought was necessary to compare every row with each other) and checking with string similarity node in a chunk loop, the rows where squared and due to the large number of rows it would not terminate. Probably I chose the wrong path of solving the problem so I made up an example-workflow for you to test.
stringmatchingtest.knwf (70.7 KB)

Keep in mind that data sets I need to work with are bigger in size, so the solution should work for those also.

Thank you very much!

Hmm,
in your first you referenced to one or more “adress” columns. I cannot see anything like this in your “stringmatchingtest.knwf”

I tried to explain how the address column was formed.

https://forum.knime.com/t/string-matching-in-one-column/86033/3?u=roman_huebinger

This may help you. Initially there were three address columns (e.g. for street, postal code, city). But due to inconsistent entries there are missing values, wrong orders or wrong spellings. Thats why I merged those columns into one for a better comparison.