Rule Based Filter Question

The idea of fingerprinting in data is that if you do not have a unique ID you create your own from various features of your data.

Classic example would be you have customers but no customer ID. You would take

  • name, surname
  • adress
  • area code
  • phone number

and combine them. You could do several variations like only using the first 5 characters of a name and ‘clean’ it by removing special characters or even using a phonetic extraction to counter various styles of writing. Or from an address you extract the numbers and remove special extensions. For example you have (a german address)

Hauptsraße 11a
Hauptstr. 11A
Haupt Strasse 11
Hauptstrasse 11/AB
Haupt-strasse 11 - á

=> you could shorten that to:
haupt 11

since this would be the ‘major’ component of the entry distinguishing it from everything else. Next component the area code


=> just use 503 since you would expect most people to get the first part right. The same with dates of birth

Nov 1990 7th

=> maybe just use 1990-11 without the date or just the month if this would be bring enough distinction

haupt 11-503-1990-11

might be pretty unique (depending on your data) and you have a good chance of capturing typos and various spellings. If your data is relatively clean you could just throw several items together like in the example and use them. You have to do a few tests und decide which approach you like.

Actually there are several blog entries in KNIME about this and I also created a small workflow to compare strings. KNIME provides you with several nodes to help you with this task.

(the title is misleading since the question about addresses was added to another topic but the links there should work)