Removing Entries based on Sets

Hard to explain this and can not easily share any data due to personal content but will do my best here.

I have two tables of data - one is a list of duplicate records identified by a number. These are small groups of records that have been identified by users.

DuplicateGroup    ObjectReference
Group1                   34
Group1                   44
Group1                   65
Group2                  100
Group2                  433
Group3                  34 
Group3                  433 

I have a second table that contains groups of two records that have been checked and identified as not being the same. Object References can appear in here multiple times

KnownDifRef            ObjectReference
KTBD1                      34
KTBD1                      65
KTBD2                      34
KTBD2                      433

What I need to do is remove from the first list all the entries where the Object Reference has been identified as being different from the whole group. For example Group3 should be marked up for exclusion as both entries are in KTBD2. Group one should still show as while 34 is different to 65 they both need to be compared to 44. I can deal with the Group3, when there are just two values, but what I am struggling with is if the following are added to the second table and there are multiple checks:

KnownDifRef            ObjectReference
KTBD3                      34
KTBD3                      44
KTBD4                      65
KTBD4                      44

Now the added complication - my data has already reached 1.4 million potential duplicates and 60,000 Know Difference pairs. I know KNIME can handle this sort of question but I am just now sure how, and more importantly how to do if efficiently since the numbers are only going to go up, at least for a while.

I know that is probably not very clear but anything anyone can throw up would be really helpful - at the moment I am saying that there are more duplicates than there are which makes people discount any reporting, rather than accepting that they have the issue with the 1.4 million records!

Hi @StevenFrancis -

What is the expected result when you use the multiple checks? Is it that you should have Groups 1 and 3 marked for exclusion, while Group 2 is kept?

If so, then you can use the Joiner node with Left Outer Join mode, where the left table contains your groups and the right table your KTBDs. The resulting matches will be what you exclude, while the group associated with missing values you keep.

…but I may not be understanding your problem correctly.

Another thing to try might be to convert your ObjectReference values to sets using the GroupBy node (with Set aggregation), and then using the Subset Matcher node to do a comparison.

I had started down that path actually but got confused - my title kind of hinted at that. My thinking was that I wanted to find all Duplicate Groups where the set of the Object References was covered by sets of Known Difference groups. However it is all the sub groups/couplets that need to match, and I then need to mark the group up with the last date as we have users using this to just clear out lists rather than managing the records well (in my opinion)

Just for your first post - yes the result should be that Groups 1 and 3 would be marked up as excluded. I can then count these along side the total duplicate groups, the number of possible duplicate records and also the number of unique duplicate records since they can appear in multiple match groups.

OK, it sounds like the Joiner node should work then. The next question, I suppose, is how well it will scale with millions of records.

Is your data stored in a database, or are you reading flat file extracts into KNIME? Depending on how the data is stored, you might be able to use the DB Joiner node to perform the same function, but push the processing onto your database server, instead of bogging down your local machine.

The data is originally sourced from a database but this is a midway step along the process to create stats and files so unfortunate I am doing this on my laptop. So far it is managing and time is to run is not to much of an issue as one of the queries pulls back data from a very slow table so I need it to just run.

I have managed to solve this with an approach that works and lets me mark up not just the groups but actually the individual records that are in a group and known to be different to all others, even if those others have yet to be reviewed yet. This means I can count these alongside to indicate how many groups may need to be revaluated based on coming up as potential duplicates again.

My approach was to turn both lists into sets of two column “couplets” and joining on both columns - if the count of matches on my match groups is equal to the number of possible duplicates -1 (not counting itself) then that record has been checked against all other records and can be excluded from the general stats that most people are bothered about. If this number is lower then there are others in its duplicate group that it needs to be cheeked against.

I need to do a full rerun of the data and compare against the live system tomorrow but just looking at a few examples from my data seems to support that this works. Only slight issue is that it creates a working data set of over 14 million lines at one stage but it runs and no memory issues so for now I can run this and hopefully get people to focus on working on removing the duplicates rather than arguing over the numbers being 3 higher then they think it should be…

Thanks for the help - it has defiantly been useful and got me thinking a little bit more about sets for future use.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.