Handling of Duplicate Datasets in tables

Hi all
Here a question about cleaning a dataset.
I have two tables. Data and Dictionary.
My target is a result table with a 1:1 match of data and dictionary values.
The total amount of data in both tables is the same.
I use VALUE LOOKUP node - so far, so easy.

The problem is, that there are occasionally values in each table (in the lookup column) that have the same value - so I get multiple matches with the dictionary table, n:1 and therefore also data in the dictionary table that is not matched at all to the data table.

I have a way to filter the orphans - by stacking Value Lookup nodes behind each other.
I also have a “clean” result table without the duplicates - using the Duplicate Row Filter node.

Now comes the tricky part:
How do I get the orphan datasets back into the results, with the match of the not-yet-matched dictionary?

Everything I try just keeps getting me the again the duplicate results, and not matches to the not-yet-matched data.

I feel like I am in a catch 22. Grateful for any ideas…

Hi @Cal_Steffen,

I think it would be easy to solve if you would share a workflow (containing demo data).

To me, it sounds as if a Reference Row Splitter could be helpful.

Also, the Joiner allows you to seperate the orphans from each table to separate outputs.

When I use a dictionary table, I try to keep the entries in this table unique. Can you tell me, why yours has duplicates?

thanks for the answer.
I am try to do a account reconciliation.
I have different bank accounts that accrue charges → My data tables.
On the other hand I track my spending in a tracking app → My dictionary.

They way I check if my credit card statements are correct is, that I match the amounts and see if each charge has a corresponding entry.
I do make unique entries, but the value I match is the amount spend and that will have duplicates.
I want to find a way to match each value only once and then see if there are any mismatches - e.g. missing entries in the cash tracking app or wrong charges - or mismatches due to wrong accidental ewrong entries in the cash tracking app.

The data is all my credit card stuff, so I need to find a way to provide a clean sample dataset.
Is there a way to share the workflow without the data?

I think the GroupBy node will help you. Before joining via amount you should de-duplicate the data.

I have uploaded a workflow to KNIME Hub which shows the concept I have in mind.

I have two different data sources: Data from two banks and one table with tracking app data. Before joining, identic amounts get grouped and the number of occurrences as well as the sum for each amount is being calculated. The bank(s) which charged the amount are also handled using the List aggregation method. This means, that no information is lost.

After joining, i replace missing numbers with 0 for integers and 0.0 for doubles so that I can use the math formula to calculate the difference. After that, there is a rule engine which describes the case.

The resulting table allows you to check the rows with differences. Duplicates won’t occur.

Of course, this is simplified. In real life you will have more information which you could use to improve checking (date of payment, payment Reference, etc.). There might also be edge cases which are not covered in my example workflow.

I hope, this helps. You may still share your workflow. In this case, I suggest that you replace your real data with fake data. Via google you can find many possibilities if you don’t want to type everything by hand.

1 Like

Alright.
I got to look at your suggestion, but I don’t think this will solve my issue. The repeated charges are not necessarily from the same merchant. over the statement there are just charges with the same amount and I would like to be able to separate them out.

I do like what I got to learn from you about the rule based engine and will adapt that for sure in my result to make my result easier to look at - maybe in one result table instead of 4 different ones like I have it at the moment.

So, to show what I did I made some sample data and uploaded my workflow to make further conversation easier. I think I still need to name the sample data better, but I hope for a start it will do.

You can see my problem looking at the charges of 24$ and 30$, which get matched to the same credit card charges for different cash app entries. At the same time I have lost the entry in the Credit Card mismatches (in case of the 30$: the entry from Citibank disappeared - it is not a mismatch, but it is not in matches either…)

Feel free to have a look at this convoluted mess:
[KNIME_Data_Merge_All_Cards_v2 – KNIME Community Hub](https://Link to Workflow)
Bank of America Credit.csv (418 Bytes)
Bank of America Debit.csv (593 Bytes)
Cash-App-Data.csv (4.5 KB)
Chase Credit.CSV (932 Bytes)
Citi Bank Credit.CSV (1017 Bytes)

Apple Credit.csv (1.0 KB)

I realized there was a mistake in the Citi Bank csv.
Here the new version
Citi Bank Credit.CSV (1020 Bytes)

in the meantime I know why I have lost the entry Citibank Charge 8. I guess this comes with three entries, since my workflow only is enabled to look at the first and the last entry in case of duplicates (which is my problem, that I am trying to solve…)

With all this, bear in mind the this is my first workflow :slight_smile:
Happy about any improvements and suggestions:

Wow, that is a large first workflow. :slight_smile:

It was not running through out of the box. I had to change a few things. I got rid of the upload widgets and focused on your main Component. Also, I placed the data inside the workflows /data/ folder. This is good for sharing, because you don’t have to upload many files and also your private info like username and folder structure won’t be revealed.

I checked the CSV readers. Some columns were missing. I inspected the Transformation tab and saw that some were filtered out. To me it turned out to be a better strategy to read all columns and explicitly use a column filter or other transformation steps later.
image

Next thing I saw were the metanodes and joiners. Inside the metanodes you use many Constant Value Column nodes. These take up lots of space. Check out the Column expression. With one node you can create many columns (1 per expression). This goes for constant values as well as complex formulas, dates, etc.

I then checked how you connect your data sources. You defined very compley join criteria and did full antijoins. Result: All rows of both tables land in the same table structure. It seems to me that Concatenate is the node you want to use.

The result would be the same: 75 columns of transaction data.

I was not able to execute the workflow further, because the Year to Date file was missing. Also here I think Concatenate would be a better node choice.

Then comes the Value Lookup part. I am not experienced with this node. I still mostly run version 4.7 of KNIME and the mix between old and new UI dialogs is strange to me. I would rather use the Joiner here. Do a full outer join and you see where it matches and where not. Regarding the handling of the duplicates: You need a common identifier. The value/amount is not sufficient. It could be a date, this is why I used String to Date&Time (knowing that the day you pay something with your card is not necessarily the transaction date.

I would use the Note column in your Cash App Data. Go over your list and accounts and for example the Description or Details in you other datasets. Find corresponding transactions and put the same word/number/etc. in the columns of both tables. The entry must be unique and match in case. No typos allowed.


Hope this helps. Here is the link to the edited workflow.

1 Like

Hi JLD

Thanks for putting som much thought into this!
Sorry about the missing file - in the sample dataset it is literally an empty sheet with just the column names, but I did forget to include it. I have attached it here now.

I did replace the constant value columns with the column expression.
However I could not get that Node to
a) rename columns - is that even possible?
b) change string to date without error.

On the note of changing string to date: It seems that my bank thinks it to be very funny to change the date format from mm/dd/yyyy to m/d/yy or mm/d/yy or whatever the monkey with the typewriter thinks is a good idea this month. Is there any way to make the date to string node smart enough to just figure it out based on a pattern like this m?/d?/yy?? with “?” allowing also for no value to be present?

regarding your suggestion about including all columns didn’t make sense to me. Why read data that I won’t use? Won’t that just slow down my workflow?
Also, like this I was able to just copy the line of fixed value columns and others until the joiner to all cards without having to individualize them too much. What would be my advantage the other way?

I did replaced the Joiner node with the Concatenate node. I learned about that one after I had already built my workflow and just didn’t want to do it over - but since the workflow grew larger, a quicker execution now seems to come in handy.

Regarding value lookup - well it is the equivalent to vlookup in Excel. You specify a data from a data table to be found in a dictionary table and return the rest of the dataset.
I will try out the Joiner node as an alternative - this might actually save me some complications later in the workflow, as it will give me the mismatches and orphans directly. Haven’t had the time to play with that tonight, but will pick that up some day.

Regarding changing the notes or finding a different identifier: I simply can’t change them. on the one side they come from my bank - and they will write in whatever they want. On the Cash App it comes from my loved family members, who I can’t even get to put the toilet paper on the holder right way around or load the dishwasher like a reasonably sane human being, so I certainly won’t get them to comply to a predefined note pattern :grin:

As a solution I’d be even happy with a prompt to manually assign the matches - is that a possibility?

Year to date 2024-01-29.xlsx (115.8 KB)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.