Hi there. I'm a newbee to KNIME and am loving the product so far!!!
We have a tables which contains infromation about individuals (name, address, DOB) and specific unique identifiers (drivers license, birth certificate etc.). We commonly get duplicates where two identities are very similar i.e. last name might have a typo, drivers license number changes (when it gets renewed), DOB changes, etc. and have to merge these.
I need some help with what is the best method and nodes to use to compare between rows to identiy potetial matches to duplicate identities across the columns which vary slightly. For example last name might have a letter different but DOB will be the same. Conversley DOB might change by drivers license number will be the same. Worth noting that there is rarly a constant to match against as most rows vary slightly.I've tried combinations of group by, filter etc.
Any help would be much appreciated.
Many possibilities. Choice probably depends on efficiency and the application at hand.
One possibility is to combine every column into a single string column and use a string distance function to measure distance. With Similarity Matcher you can even determine the nearest neighbor(s) of every observation.
You could also use RowID for this purpose. This would identify exact matches on single columns. Try to make a column a unique identifier by ensuring uniqueness. Then use RowID again to extract the column and use String Manipulation's regex functions to identify the duplicate observations. Then proceed with the next column and so on.
Building good old lookup tables is probably THE way to identify duplicate observations. Any candidate approach would have to measure against this basic approach first. Why use expensive clustering or supervised learning, if table lookup already does it...
Another possibility is to use separate distance functions (depending on data type) and to combine them into a single aggregated distance measure thereafter. Then again use Similarity Matcher ...
If you don't know by how much to combine each distance function, you could (let some colleague help you) build a training data set where you determine duplicates (yes / no) by hand. Done on several hundreds or thousands, this training data set can be used to deploy a supervised learning model to identify duplicates. Use the distance functions' output or boolean features (calculated by lookup tables) as input for the model.