Joiner not working

Hi,

I am new to KNIME and I am using it for the first time. I used KNIME for comparing two data sheets of excel sheets. One sheet was a dump from the old software database and the other is the dump from the new software database. So simple is we taken dump from the old software in excel. Done manual mapping and then uploaded this database in new software. Now again took one dump from new software in excel. So now we are coming to this two excel for data accuracy in data migration.

So here I used excel reader to upload both excel sheets in KNIME and created two nodes for this, then used joiner to compare data for 6 to 7 columns so it can give data matching and any error it will not compare so can find our data migration accuracy between two data.

Here I found some issue is when joiner reaches at 99% it got freezes for a long time. Data size in each excel sheet would be 6 to 6.5 lacs (650000) rows. I have waited for around 2 hours but the joiner is not going further from 99%. Here I would like to add that previously i had used the same workflow for two excel sheets which was having some 10000-row data and that worked effectively. But this time joiner does not move further from 99%.

Please help me to resolve the issue. Also please suggest any better way of doing data comparison for data accuracy and quality of migration.

You can try

2 Likes

Hi @patelbhavin , I am not sure why the joiner is slow as I can’t see what you are doing. Also, joiner would not necessarily be efficient, since for checking for data accuracy, you would need to join ALL columns.

The Table Difference Finder node proposed by @izaychik63 is definitely the way to go here. Just make sure that both tables are sorted the same way and that the columns also in the same order.

For example, consider these 2 tables:
Original Dump:
image

New Dump:
image

The data from the New Dump is the same as from the Original Dump, except that the data is not sorted the same way.

The Table Difference Finder will return this as a difference:
image

So, always make sure that the data is sorted the same way on both tables before you use the Table Difference Finder:
image

And no difference detected:
image

If you sort by unique keys, it’s enough.

Also, as you can see, the results for the Table Differences can be quite big. I got 8 rows just to show the difference between 2 rows - It’s basically 2 x the number of columns, so if you have 10 columns, that’s 20 rows, etc…

If you don’t care to see the differences, and only want to know if the tables are different, you an use the Table Difference Checker instead:

Here’s a workflow that’s using both:
Comparing 2 tables.knwf (12.9 KB)

You can play with the values from the New Dump table to see how the Table Difference nodes react.

1 Like

Thanks for your reply

@bruno29a @izaychik63 Thanks for your reply.

As per our requirement, we have to put logic like this

we have two excel sheet dumps from old data and new data…

Both sheets don’t have an equal number of rows in the table sheet. one excel file sheet has 5 to 6 lac data and the other has some 100000 more…

So what we thought of logic is to that if say five columns which has same data that match, like col. A, B, C, D, E of file A will be matched with col. A, B, C, D, E of file B, now if the data of this column got match then data is match and data is correct. and ones which not match will check manually what problem and clear the data.

Here we don’t have data rows same, data in one file is bigger than others so it has happened that sorting data may not give the same data in both files, also please help me to understand better so use the above logic in the table difference checker…

Hello @patelbhavin,

What KNIME version are you using? How many memory do you have assigned to KNIME? You can check here where to find it and how to change (increase) it for better performance.

Regarding your task Table Difference Checker can’t be used in your case but it’s hard to suggest proper approach as requirement seems not entirely clear. Note that although this might be tricky in your case but from experience ideas/suggestions/solutions come faster and are more complete with input data (dummy works just fine) and desired output provided including logic behind it if not obvious. Additionally comparing tables is pretty frequent task and there have been many topics around it so I suggest to give Forum search a try as well. Maybe you’ll find answer that way :wink:

And last but not least welcome to KNIME Community!

Br,
Ivan

Dear @ipazin, Thanks for your reply, I am using KNIME 4.4.0, Also checked with updates it is updated till today. The memory assigned is -Xmx8067m around 8 GB.

So I have small dummy data which is the same as the huge data we have. export_file_1_knime and export_file_2_knime two excel files. What I am trying is simple logic to compare two columns as per below.

Col. A (export_file_1_knime) to Col. A (export_file_2_knime)
Col. B (export_file_1_knime) to Col. C (export_file_2_knime)
Col. D (export_file_1_knime) to Col. I (export_file_2_knime)
Col. C (export_file_1_knime) to Col. K (export_file_2_knime)
Col. H (export_file_1_knime) to Col. J (export_file_2_knime)

So if these columns get matched then the data of that particular row is correct.
export_file_1_knime.xlsx (94.1 KB)
export_file_2_knime.xlsx (121.3 KB)

Hi @patelbhavin,

my solution based on your example i would create a final table with the following output.

Example2

The workflow looks like

In addition to your defined keys i added a further key: Col. E (File1) to Col. B (File2).
And based on your system settings it could be that you have to modify some nodes (e.g. Date transformation).

This is probably not the best solution, but should do what you expect. For very large datasets i recommend the use of a loop (e.g. a collection of Test IDs) to increase the performance.

Example.knwf (484.4 KB)

Hopefully this example helps you or at least give you an idea how to solve it.

BR

1 Like

Hi @morpheus hanks for your support… Sorry for late reply but was away form work for some days… i would surly love to explore your solution and would update you for same.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.