How to compare 2 csv files

TotalDataLoss · April 21, 2021, 2:34pm

Hi there,

i am trying to compare 2 csvs if they are identical till a certain point of time. The file looks like:

14.09.2011 07:00:00,5.58,5.58,5.58,5.58,12549
14.09.2011 07:34:43,5.54,5.54,5.54,5.54,3418
14.09.2011 07:55:46,5.62,5.62,5.62,5.62,459

the first column is the timestamp of the data. The timestamp is unique in the table.and it is sorted in an ascending order.

The file is created by a knime workflow, that runs periodically, and creates a new file. The new file contains all the data that the previous (aka old) file contains, and of course, some new entries with newer timestamps (always in ascending order).

What i want to do now is to compare the new file with the old file to check for inconsistencies, whether the old data has changed. The last row of the old file mibht be different from the new file (since its actually an aggregation of a time span that might not have the full interval at the time the file was created) from the new file without beeing an error.

So what I want a process that compares if the new file contains the same data till Rownumber-1 of the old file, ignores the rest of the rows and reliably gives out an OK if that is the case. (Giving the OK could be an Popup window, or writing a file with the name OK etc, writing an empty table…) If differences occus it should write those differences in a csv file.

What wold be a good way of doing this? I am thinking now for a while about it but i don’t really get anywhere with this. I guess my main problem is in case of a difference to write both lines in a csv file…

I read in the forum about the “joiner node” but since i already know that from a certain row on there will be only differences it seems a very wasteful way of doing this (files are pretty large).

Great forum and i hope someone has an idea…
Greetings!

PS the attached file is some testdata just to illustrate.

Upload.txt (1.2 KB)

iperez · April 21, 2021, 8:16pm

Hi @TotalDataLoss what I’d do is to filter both files excluding the undesired rows, then I would concatenate both and use the Duplicate Row Filter with the option “keep dulpicate rows” checked in the Advanced tab. Hope it helps

Mark_Ortmann · April 22, 2021, 9:00am

If the tables have the same order and length this node might do the trick Table Difference Finder – KNIME Hub

TotalDataLoss · April 23, 2021, 11:16am

Hi iperez, Thanks for your answer and sorry for not coming back to you yesterday, i was experimenting a lot with your solution. It’s an interesting one I built a working solution yesterday, so thanks to you. But i ran a bit into trouble with it (but more the general question how to store a “data driven calculation” in a “flat” flow variable in Knime.
I made a separate Post about it since it’s a broader more general question in case you are interested you can find it here:

The solution i built yesterday thanks to you looks like this:

The problem occured on the yellow marking. But interestingly the it did work in the end anyhows
So the approach (as you can probably already see) was to take the rowcount of the old csv, and by cutting out the newly added rows (since its ordered by time of entry) in the new file. So basically reproducing the old file out of the new file then concatenate… and so on.

The endresult was right but has some drawbacks:

One Disadvantage of concatenate is logically that the lines are not directly underneath each other (or side by side), i mean logically first comes all the line from file1 than all the lines from file2.

But it was very handy to learn how you used the “Duplicate Row filter” mark “unique” rows. Something i would have not thought about. This was very slick to know for future tasks.

Thanks and have a good one!

TotalDataLoss · April 24, 2021, 12:48pm

Hi Mark,
Thanks for the tipp. I found the time to test “Table Difference Finder”. I tested this quite a bit but ran into several problems maybe you/someone could look at it.

First i created 2 identical test files (copies, identical hashes) then changed 3 values in the files.

The workflow looks like this:

The first problem was that the Table Difference Finder Node gave the warning:

But the tables are form structure identical i only changed 4 data values inside the table.

To be able to continue i changed the Fail option to never - now the node ran through without problems.
and produces a correct ouput tabe aka finds the 3 changed values:

But then the csv writer failed

My plan at that point was that i wanted to have a file with the differences since i had to switch off the warning for different values. My prefered solution would have been that the Table Difference Finder would warn me in case of a difference in values.

Is there a way that i can get a correct warning or popup that if values or table are different? Or that i can somehow write the output table which spottet the differences correctly to table?

Problem seems to be that they have the type “DataCell”:

The “domain differences” show in the written csv some differences that seem odd too:

My Goal would be that i get a warning if a) values change or b) anything in the table structure is changed - since the files should be identical.
Without the errors the Table Difference Node would be exactly what i am looking for since it gives a beautiful summary where those differences occur.

Is there some hackaround different way to get to this results?

Greetings and have a good weekend!

laaaarsi · April 26, 2021, 1:33pm

Hi TotalDataLoss,

first I would like to thank you for providing a workflow and your detailed description!

Since you changed some values like the volume from 25000 to 25001 this changes the upper bound for the Volume column domain (compare the “Spec” tab in both CSV readers) and therefore the node will fail because on differing specs, as the domain is part of the column spec. This is also the cause for the “domain differences” output. The domain contains information about the possible values of a column as well as upper and lower bounds of a column.

I modified your example workflow a little bit so that you are able to write out the value differences (its basically just a conversion to a String column) and altered the output of the “Domain Differences”. I removed the “Domain equals” column and added a “isSameName” column and did a check over all columns to check if something has changed / is different with respect to the column type, the position and the name.

What exactly do you mean by getting a warning?

I hope this will help you!

Best regards Lars

09_Compare csv_Table Difference Finder_Upload.knwf (30.5 KB)

TotalDataLoss · April 26, 2021, 4:39pm

Hi @laaaarsi, Thanks for your answer and solution.
Unfortunately it’s gotten very hectic the last 2 day and data checking/verificationi is a bit on hold.
If i have some air i will come back to this.

Thanks a lot…

system · June 16, 2021, 1:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.