Just got into KNIME and have started my first project. Taking two large Sheets and Comparing row by row, cell by cell. I have been able to get the Row by Row created and working properly but now am wondering what nodes to use to go Cell by Cell, looking for matches for each column of each row, if no match then I want to return the header of the column so I can understand what has changed between the two data sets.
Just took some time to take a look at this node. Seems like it wont work for what I am trying to do. For example, I have two tables with 14 columns. Each sheet has 30k+ rows. I need to find rows that match (Already able to do this), then for the rows that dont match, find the closest match from the other sheet and return the specific column that was responsible for the mismatch.
Can you share your current workflow? Make sure to include the data. Also can you explain this in more detai?
“find the closest match from the other sheet and return the specific column that was responsible for the mismatch.”
Can you explain what are these values that are in the columns? Would one be able to bring them together into one string and compare them? Maybe like this:
Once you have matched the closest Row you could iterate over them and see which ones do match exactly and which are not. I think this should be doable.
Maybe you can provide us with sample data that would closely match your task without spelling any secrets.
So essentially what the issue is → Once I complete my initial comparison from Data set 2 (NEW DATA) to Data set 1 (OLD DATA) → Cell values in set 2 have potential of being updated since set 1 was released.
For Example: Rows are now arranged differently in cell 2. Also instructions for John Doe now reads “Leave at door” instead of “Leave @ door”.
I am trying to sort out the rows from data set 2 that have matching rows in data set 1 (including duplicate entries). Next for the mismatches, re-compare and start by looking for matching First and Last name - if no match for either or both then its a “new entry”. If there is a match for first and last name, compare the remaining columns and find which columns changed between the data sets → append column containing the mismatched columns headers.
if you dont have a proper order no, customer key or similar identifier, you will not get a 100% solution programmatically without hardcoding “decisions”.
I would go back to the data owner and ask for proper datasets.
if you task is to work around this limitation, you will need to make up your mind about the matching first: people dont always provide their real phone number or it will change. last names change too (marriages) etc.
if all of this aint no issue, you can simply unpivot around your identifiers, concat those tables by prio (highest first) and then pivot: all columns once for FIRST to get the latest value, and once with UNIQUE COUNT. everything with UNIQUE COUNT > 1 has had changes
The goal of this comparison is to identify changes between data sets. Like a new data sheet for customers come in, then review of what each customer has had changed(Address, Phone #, Zip). In the case of new First/Last name, this would be considered a new customer. So if first/last name match then the comparison of the remaining rows can be done, if no matching first/last name → new customer