Table difference Finder - incorrect differences in column name identified

taylorpeter55 · January 24, 2024, 1:58pm

Hi all,

Hoping for some help please.

I’ve got two files that i’ve read in to KNIME - one is an XLSX file and the other is an XML file - I want to check the differences between the two files.

I am using the Table Difference Finder node and in the output, KNIME is telling me that select column names in my two data sets do not match - please can someone check what I have done below and advise if there is something i have missed?

Below you can see the output for my two datasets.

image1617×180 78.8 KB

I’ve formatted the XML data into columns to enable the use of Table Difference Finder.
To me, the columns headers look the same so I am not sure why its finding differences.
Interestingly, all columns that didn’t have " marks in them to begin with are matched correctly in the comparison.

As you can see below, i’ve tried removing the " marks from the Column Name but a difference is still found.

When I say a difference, I mean that the columns from the separate sources are not being matched correctly for comparison.

3)I’ve tried removing white spaces from both column headers (none found) and i’ve also checked the length of both column headers (see below screenshot) - the length in both cases is 34 characters.

I think this screenshot shows best that the two column names appear to be 100% the same.

Please can anyone advise what I am missing here? Why is KNIME recognising these columns as having separate names?

Thanks in advance.
tp

takbb · January 24, 2024, 3:04pm

Hi @taylorpeter55 , it’s not easy to guess from the screenshots, would it be possible for you to upload a small workflow with just a sample of the failing rows?

The difference finder appears to saying that the value 566.4 appears in one table but is missing in the other. Does it definitely appear in that column in both tables. I don’t think you’ve shown us a picture of the two tables (only the result of the difference finder) so we can’t tell what it is actually comparing.

morpheus · January 24, 2024, 3:40pm

Hi @taylorpeter55,
i’ve never used Table Difference Finder.

You’ve created a table based on the xml-File and compared it with a table of an imported .xlsx file. One reason for the difference could be difference in column data types e.g. String vs. Number (double).

Did you checked it before?

BR

taylorpeter55 · January 24, 2024, 4:51pm

Hi @morpheus Thanks for your reply. I did already check this and all columns in both the XLSX tab and XML table are classed as String.

Thanks.
tp

taylorpeter55 · January 24, 2024, 5:17pm

Hi @takbb,

Thanks for your reply - hoping you can spot something here, it’s driving me crazy haha

Please see attached an example workflow.

All column names definitely appear in both the XLSX file and the XML and both have the same value of 566.40.

Thanks.
tp

XML comparison example workflow.knwf (302.2 KB)

takbb · January 24, 2024, 7:17pm

OK, @taylorpeter55 , you have a Unicode non-breaking space character \u00A0 in one of the outputs at the point where you have a regular space (Unicode \u0020) in the other. I suspect this is there courtesy of Excel.

This is between the words “Amount” and “Currency”. You can read the text downwards in the first column in each table.

From the workflow, it looks like you are still on KNIME 4.7. There is a String Cleaner node in KNIME 5.2 but I don’t think that would help here, because you actually want to convert these into regular spaces rather than strip them out completely otherwise the column names still wouldn’t match.

You can resolve this using String Replacer with the following config with the following regular expression to find all “space” characters

\p{Zs}

(put a single space in “Replacement Text”)

and I would suggest doing this both for your XML column and the empty_a column, just to be on the safe side.

That’s all a bit of a faff, so I have wrapped the String Replacer into a component called Unicode Space Replace so you can make use of that instead if you wish:

Place this (or String Replacer) in the workflow as shown here:

If you are interested in how I found the set of character codes for each item, this was done with a another couple of components:

Filter out a SINGLE row of data to be inspected (it gets horrible if you try to apply this to more than a single row!) and string the two components as follows:

Configure the Cell Character Splitter to split the column of interest into individual characters (one column per character), then transpose gives one character per row. Column Auto Type Cast makes it a String in case it has become “undetermined type”, and then the Character ASCII and Unicode component is configured to inspect the required column.

This will give you the ASCII and UNICODE for each character in the String.

taylorpeter55 · January 25, 2024, 9:35am

Hi @takbb,

This is amazing- thanks very much! You’ve honestly saved me so much time and thankfully i’ve also still got some hair left haha

I tried to download the Unicode Space Replace and Cell Character Splitter but unfortunately, I get the following error shown below. Do you have any idea why this might be?

However, and most importantly, i did manage to solve my issue using your tip about using String Replacer with the Regular Expression.

Thanks again.
tp

takbb · January 25, 2024, 12:07pm

Hi @taylorpeter55 , glad to hear you got your data sorted. Thanks for marking the solution.

That error message is an odd one. Was that actually during the download from the hub that it showed that message?

It looks very “internal”, and there’s nothing particularly special about the components as they are just using standard nodes. Which KNIME version are you using, and which browser are you using during the download?

taylorpeter55 · January 25, 2024, 12:31pm

@takbb the issue happened when I tried to drag and drop the nodes into the workflow directly.

I’ve got version 4.7.5. as it’s the latest version approved by my organisation.

I’ll bookmark this conversation so i can re-visit it once we have a newer KNIME version available.

Thanks.

system · February 1, 2024, 12:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.