right now I'm trying to find duplicates in a large table (> 5 mio rows and 200 columns). But I have one problem and therefore I created this test table:
number
typ1
typ2
typ3
1
typ123
Haasd
Dsfd
2
typ123
3
Dsfg
4
Rttkjfrgd
Asdfasdf
5
6
Klpj
Ssgf
ty34
7
Badfd
8
Ladf
9
Lqsda
10
Safdsa
Lada
typ123
My problem now is that I don't know how to compare one entry (p.e. number 1/typ1 =typ123) with all the other entries in column "typ1" but also in "typ2" and "typ"3" at the same time? And the same comparison should be done for all entries in each of the 3 columns "typ". As a result I would like to get the distance (p.e. Levensthein) for every comparison. I would be very happy if someone of you has a hint how to handle this problem.
Sorry for being negative, but if you want to calculate all distances from 5Mio Rows and 100 Column, this will mutlipy to 50* 10 ^ 17 distances to calculate... This will need ages, independently of the method you are using to do this.
However if you want to do this:
Use the unpivoting node to rotate the multiple columns into one (the new table will have 200*5Mio Rows)
Than you can calculate the distance matrix and get what you wanted.
If you tell me what is your goal, I can try to find you a better solution. Calcualting this distance matrix, I hightly doubt will be usefull.
thank you very much for your help. And you're right I cannot calculate the distance of all the entries. The problem is that I have not really a pattern in the data so that a preclustering is quite difficult and algorithm like the sorted-neighborhood is almost not applicable cause I cannot define a key. But right now I'm try to somehow cluster the entries into smaller groups so that I can calculate the distance afterwards with less data sets. The thing is also that I have strings in every column consisting of characters and numbers which makes it much more difficult then only numbers (double or integer).
is there any good node for preclustering string datas before using the Levenshtein method to get the distance?
Here's a quick idea which came to my mind (without having put too much thought into it and without knowing how your actual data really looks):
Index the entire data into a Lucene index once, therefore simply concatenate your columns together with a space separator into a long string, and use character n-grams as preprocessing (experiment with different n-gram lengths)
Iterate your table, build a query in the same manner as (1), to find roughly similar rows
Apply whatever similarity measure for the most similar rows retrieved from (2)
There are Lucene nodes for KNIME, though admittedly I've never used them so far.
just another thought..
use the column combiner to combine all essential rows and run a group by on it with a count..
if count >1 you have a duplicate..
left join the group by table with the original table and you have no duplicates..