Compare one entry with all rows of multiple columns

Hi,

right now I'm trying to find duplicates in a large table (> 5 mio rows and 200 columns). But I have one problem and therefore I created this test table:

number typ1 typ2 typ3
1 typ123 Haasd Dsfd
2   typ123  
3 Dsfg    
4 Rttkjfrgd Asdfasdf  
5      
6 Klpj Ssgf ty34
7 Badfd    
8 Ladf    
9 Lqsda    
10 Safdsa Lada typ123

 

My problem now is that I don't know how to compare one entry (p.e. number 1/typ1 =typ123) with all the other entries in column "typ1" but also in "typ2" and "typ"3" at the same time? And the same comparison should be done for all entries in each of the 3 columns "typ". As a result I would like to get the distance (p.e. Levensthein) for every comparison. I would be very happy if someone of you has a hint how to handle this problem.

Thank in advance

Marius

Hi Marius,

Sorry for being negative, but if you want to calculate all distances from 5Mio Rows and 100 Column, this will mutlipy to 50* 10 ^ 17 distances to calculate... This will need ages, independently of the method you are using to do this.

However if you want to do this:

Use the unpivoting node to rotate the multiple columns into one (the new table will have 200*5Mio Rows) 

Than you can calculate the distance matrix and get what you wanted.

If you tell me what is your goal, I can try to find you a better solution. Calcualting this distance matrix, I hightly doubt will be usefull.

Best, Iris 

Hi Iris,

thank you very much for your help. And you're right I cannot calculate the distance of all the entries. The problem is that I have not really a pattern in the data so that a preclustering is quite difficult and algorithm like the sorted-neighborhood is almost not applicable cause I cannot define a key. But right now I'm try to somehow cluster the entries into smaller groups so that I can calculate the distance afterwards with less data sets. The thing is also that I have strings in every column consisting of characters and numbers which makes it much more difficult then only numbers (double or integer).

is there any good node for preclustering string datas before using the Levenshtein method to get the distance?

Thank you

Marius

Here's a quick idea which came to my mind (without having put too much thought into it and without knowing how your actual data really looks):

  1. Index the entire data into a Lucene index once, therefore simply concatenate your columns together with a space separator into a long string, and use character n-grams as preprocessing (experiment with different n-gram lengths)
  2. Iterate your table, build a query in the same manner as (1), to find roughly similar rows
  3. Apply whatever similarity measure for the most similar rows retrieved from (2)

There are Lucene nodes for KNIME, though admittedly I've never used them so far.

-- Philipp

just another thought..
use the column combiner to combine all essential rows and run a group by on it with a count..
if count >1 you have a duplicate..
left join the group by table with the original table and you have no duplicates..

Thank you very much for all your answers!