Address deduplication - how to set up "class column"

jaakko1 · November 14, 2014, 5:18pm

Hello all,

I recently came across this blog post from Knime about the new data dedupe possibilities:

http://www.knime.org/blog/address-deduplication

The example dataset already includes a "class column", which defines the relation between the duplicate records. But as I'm trying to dedupe my data, my biggest problem is to build this column, as the data by default doesn't have any matching fields that I could use to do the grouping on (and is exactly why it needs deduping in the first place).

Let's say for example my data has only string column with names of companies:

Company

McDonalds

Macdonalds

Mc Donalds

How can I create a class column for this?

Thanks in advance for any help!

-J

aborg · November 14, 2014, 5:47pm

I think the Levenstein distances between the different versions of the same instances will be below a certain threshold. Though you will have to filter out the false positives manually.

Cheers, gabor