Finding similarities in a huge table

knimerin · June 4, 2019, 8:47am

Hi. Following situation: we have repair centers for electronic products which write down the defect for each device. The devices have a unique serial number. Saying this the repair centers often don’t detect one defects but even more than one. Each defect has it’s own entry in the database.
Given for example this table:
Serial Number Defect Part
1 A
1 B
1 C
2 A
3 A
3 D
4 B
4 C

I want now find similarites for the serial numbers. Given table above one similarity is found for serial number 1, which has three defects (A. B and C) and serial number 4 (B and C). I would need to mark those serial numbers with a unique similarity name. Is this something which can be done in Knime?
4

ana_ved · June 6, 2019, 9:04am

Hi knimerin!

Maybe could you tell me more about how you define similarities? If each individual defect is counted as one similarity between serial numbers, then you could use the groupby node to group by defect and aggregate the serial numbers as a list. You can then use a rule engine node to give the similarity names to the rows.

Let me know if this helps.

Cheers,
Ana

giovannicianchetta · June 6, 2019, 11:35am

what I would do is transform your sequential table into a fingerprint or bit array. you can either pivot your data using the serial number as a group column the defect part as pivot and a constant value of one as a aggregation or use the node one to many followed by group by using the serial number as group.
the output of both methods can be used to create fingerprints that capture what kind of defect each serial number had. you can then analyze similarity using the node similarity search using tanimoto as a method. on a small table you could use also the distance matrix but it doesn’t scale well for large (or huge) tables.
I hope that this helps

knimerin · June 6, 2019, 2:28pm

Thanks for your suggestions. Indeed one of our data scientists advised me to transform it to a bit array and using the bit array distance node followed by a DBSCAN node. That works pretty well.

Thanks for your help.

system · December 6, 2019, 2:28am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.