Compare one entry with all rows of multiple columns

damarius · July 31, 2017, 2:40pm

Hi,

right now I'm trying to find duplicates in a large table (> 5 mio rows and 200 columns). But I have one problem and therefore I created this test table:

number	typ1	typ2	typ3
1	typ123	Haasd	Dsfd
2		typ123
3	Dsfg
4	Rttkjfrgd	Asdfasdf
5
6	Klpj	Ssgf	ty34
7	Badfd
8	Ladf
9	Lqsda
10	Safdsa	Lada	typ123

My problem now is that I don't know how to compare one entry (p.e. number 1/typ1 =typ123) with all the other entries in column "typ1" but also in "typ2" and "typ"3" at the same time? And the same comparison should be done for all entries in each of the 3 columns "typ". As a result I would like to get the distance (p.e. Levensthein) for every comparison. I would be very happy if someone of you has a hint how to handle this problem.

Thank in advance

Marius

Iris · August 1, 2017, 10:27pm

Hi Marius,

Sorry for being negative, but if you want to calculate all distances from 5Mio Rows and 100 Column, this will mutlipy to 50* 10 ^ 17 distances to calculate... This will need ages, independently of the method you are using to do this.

However if you want to do this:

Use the unpivoting node to rotate the multiple columns into one (the new table will have 200*5Mio Rows)

Than you can calculate the distance matrix and get what you wanted.

If you tell me what is your goal, I can try to find you a better solution. Calcualting this distance matrix, I hightly doubt will be usefull.

Best, Iris

damarius · August 2, 2017, 9:37pm

Hi Iris,

thank you very much for your help. And you're right I cannot calculate the distance of all the entries. The problem is that I have not really a pattern in the data so that a preclustering is quite difficult and algorithm like the sorted-neighborhood is almost not applicable cause I cannot define a key. But right now I'm try to somehow cluster the entries into smaller groups so that I can calculate the distance afterwards with less data sets. The thing is also that I have strings in every column consisting of characters and numbers which makes it much more difficult then only numbers (double or integer).

is there any good node for preclustering string datas before using the Levenshtein method to get the distance?

Thank you

Marius

qqilihq · August 2, 2017, 10:13pm

Here's a quick idea which came to my mind (without having put too much thought into it and without knowing how your actual data really looks):

Index the entire data into a Lucene index once, therefore simply concatenate your columns together with a space separator into a long string, and use character n-grams as preprocessing (experiment with different n-gram lengths)
Iterate your table, build a query in the same manner as (1), to find roughly similar rows
Apply whatever similarity measure for the most similar rows retrieved from (2)

There are Lucene nodes for KNIME, though admittedly I've never used them so far.

-- Philipp

adaptagis · August 3, 2017, 11:14am

just another thought..
use the column combiner to combine all essential rows and run a group by on it with a count..
if count >1 you have a duplicate..
left join the group by table with the original table and you have no duplicates..

damarius · August 23, 2017, 12:38pm

Thank you very much for all your answers!