Table Difference Finder gives huge and rather useless differences

Hi

Maybe I am at the wrong place and maybe it is a none-issue because there is this option I am going to propose. Please bear with me.

I recently used the Table Difference Finder. While I feel the spec works fine, I terribly miss the option to give the value comparison the possibility (mandatory?) to take into account key fields preferably by variable.

For instance, if I compare the content of following two tables (csv representations), KDiff3 gives

and the Table Difference Finder value output (partial) is

This huge and to my eyes useless difference representation is because the difference finding algorithm does not take into account any key columns such that after the first additinal/missing line one gets but differences everywhere.

Did I miss something?

Kind regards

Thiemo

Hi Thiemo,

You can use the table difference node for it.

image

You set what you can wanna see for it.

You insert 2 tables and bring 2 outputs.

Maybe itā€™s your caseā€¦ hope it helped you.

Seeya,

Denis

Hi @Thiemo.Kellner
If the hurdle of using Table Difference Finder, it is the unpaired rows of your tables. You can rearrange the right table (your IDs are matching) , with a left outer join of the ID column, just before sending the table to the node.

Afterwards you can blank easily the differences from missing rows with rule engine.

I hope this make sense.
BR

Hi Denis

Thanks for the hint. It was exactly what I was doing.

Kind regards

Thiemo

Hi @gonhaddock

Thanks for the hint. I shall try. I feel though, the node should provide this PK thing out of the box.

Kind regards

Thiemo

Hey @gonhaddock , your suggestion will not exactly do what the difference finder or any diff applications (such as KDiff3 that @Thiemo.Kellner used) do.

Using left (or right) join will only compare the matching keys (in your case you are mentioning IDs), and you can of course identify which records with these IDs are missing from the left table (or the right table if you do a right join).

A diff will actually compare the whole record.

For example, if we have this dataset:
Table A:

+----+------+
| ID | name |
+----+------+
| 1  | a    |
| 2  | b    |
+----+------+

Table B:

+----+------+
| ID | name |
+----+------+
| 1  | a    |
| 2  | c    |
+----+------+

With your approach, you will not get any ā€œmissing rowsā€ as both IDs 1 and 2 exist in both tables. But a diff will tell you that the row with ID 2 is different in both tables.

I know some other users might reply to this saying ā€œwe can also compare the name fieldā€, well that is find if thereā€™s only 1 additional field other than the ID field. But if you have 5, 6, 10, 20 fields?

In the end, this could still work for @Thiemo.Kellner if heā€™s looking to identify which records are missing from the tables (he might have been using the KDiff3 for that purpose as the sample he showed were cases of missing records, rather than records being different)

2 Likes

Hi @bruno29a

As far as I have understood @gonhaddock correctly, the joining is only to put ghost/dummy records of the missing keys into the original data set such that the diff node afterwards can compare the data sets without the shift mess.

It is quite an effort to make. One has to mark those ghost records to distiguish them from records genuinly empty but the key. Maybe not a case often found, but one one rather should not ignore.

2 Likes

Hi @Thiemo.Kellner , indeed, I thought about that right after I shut down my computer on that night, but I figured Iā€™ll let @gonhaddock reply and confirm :slight_smile:

But you are absolutely correct that this would be a workaround to ā€œfixā€ the issue of the missing records when doing the difference finder that you raised.

Ideally, it should be fixed in the node itself. +1 for me for this feature request.

2 Likes

Hello @bruno29a and thanks for bringing me back to the topic.
Maybe I didnā€™t realized on #feature-reques tag at that time, so I just double fall into the fact that some previous work ā€˜can be doneā€™ (i said), ā€˜is requiredā€™ @Thiemo.Kellner saidā€¦

Just prepared an example for this post showing that: without some data preparation work, the result is just hard to use. In the example, the compare table has just two differences: 1 content + 1 missed row.


20230317_table_difference_data_preparation_v0.knwf (95.3 KB)

Without ID arrangement:

  • upper ā€˜Table Difference Finderā€™, without fill in gaps and pair IDs returns an useless output, messing the IDs.
  • lower ā€˜Table Difference Finderā€™ in the other hand is returning a row for each difference, useful output.

Then, decidedly and as @bruno29a said and about select a field/column to track differences in this nodeā€¦

BR

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.