URGENT: Duplicate rows after full outer join

rutgerverhaar · June 23, 2021, 1:33pm

Hi Everyone,

I have a workflow in which I compare two tables.

I transformed the tables in such a way that the column names, data types etc match between the two and that I can compare the tables row-by-row using the joiner and setting the joiner to full outer join.

I join the tables on Unique ID and Column Name and set the column selection to include all columns in both tables except for rowIDs and do not remove any of the joining columns.

The issue I’m facing is that after the join, a number of rows have been duplicated. I have attached a screenshot of the output.

Does anyone know what is causing this issue and how I can solve it? I’ve been stuck for a while and really need some help…

Thanks in advance,

Kind regards,

Rutger

bruno29a · June 23, 2021, 1:59pm

Hi @rutgerverhaar , that’s what full outer join is likely to give you.

I am not sure what you are trying to compare, but if you need to stay with full outer join, you can de-duplicate the results using the Duplicate Row Filter – KNIME Hub node

rutgerverhaar · June 23, 2021, 2:11pm

Hi @bruno29a,

Initially I read in an excel file that looks like this for both sources

.

Then I use the unpivoting node to transpose the table that enables row-by-row comparison of all values in the initial table. The output then looks like this for both sources:

.

Then I join using full outer and compare both tranposed tables to figure out where rows are exact matches and where there are differences between the tables.

Would you know of another way on comparing two tables that would not cause duplication of rows? Do you know why I’m getting the duplicate rows? What is the logic behind the duplication?

Kind regards,

Rutger

bruno29a · June 23, 2021, 2:24pm

Hi @rutgerverhaar , here’s a brief explanation of joins I put together:
Inner join will give you the records that have the same values in the joined columns between the 2 tables.
Left join will give you all the records from your left table + what Inner join will give you
Right join will give you all the records from your right table + what Inner join will give you
Full outer join will give you all records from both tables

From what you are saying, the 2 tables you are comparing are more or less the same, and full outer join giving you records from both tables, chances are you are going to get a lot of duplicates.

So, that’s the logic behind it.

gonhaddock · June 23, 2021, 3:01pm

hello @rutgerverhaar
@bruno29a explained it right. I don’t see the point of unpivot for comparing purpose, you can just work with the UIDs

This workflow illustrates how to compare what changed between two datasets. I created it for another post, but now it is updated with the ‘Full Outer’ Join and ‘Inner’ Join

KNIME_table_compare.knwf (26.2 KB)

Regards,

rutgerverhaar · June 23, 2021, 4:23pm

Hi @gonhaddock,

The reason for doing the pivot is that it would allow me to quickly spot the differences between the columns. If I would do the comparison without the unpivoting node I would end-up with a huge table which would only show me that there is a match or mismatch.

If there would be a mismatch, I would have to dive into the output of the joiner to manually figure out where the difference is (I could potentially create a rule engine that would do that for me, but then I’d have to write a logic statement that would compare the column (each file could have up to 50 columns)) in each of the sources.

Kind regards,

Rutger

takbb · June 24, 2021, 8:27am

Building on what has already been said about joins, and hopefully clarifying… you should only see duplicates in your data if the join is not based on totally unique keys in both of the tables being joined. That should hold irrespective of the type of join used.

If my (left) table contains

and my (right) table contains

With an outer join, the rows returned are:

This output comprises all that are ONLY in left, plus all that are ONLY in right, plus all joined rows that are in BOTH)

It follows therefore that provided the join finds UNIQUE rows in both tables, then the total number of rows returned can never be more than the sum of the total rows in both tables (which is the limit for when no rows match). The lower bound for the returned rows is if the set of all the keys from one table is a subset of the keys from the other table, When this occurs, the the total rows returned will never be more than the size of the larger table.

So for an outer join, the range of counts of possible rows returned, assuming unique keys in both tables is

count of rows in larger table <= rows returned <= sum of count of rows from both tables

The only time you should get “duplicates”, and therefore exceed the sum of rows from both tables is if you are using non-unique keys from one or other of the tables in which case, you will get “cross joining” of those specific data items

If my left table were

and my right table were:

then the full outer join will now see two rows containing B (the cross join for B being 2 rows x 1 row = 2 rows) and two containing A

If B appeared twice in both tables, I would expect to see 4 rows containing B (the cross join for “B” being 2 rows x 2 rows = 4 rows)

So if in the above example my right table were:

This would now result in the following:

So, in summary, if you are getting duplication out of a full outer join, this must be because you are using join criteria that is non-unique on either one or both of your input tables.

system · July 1, 2021, 8:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.