Similarity in date&time

DA2023 · November 6, 2023, 5:40pm

Hi everyone,

I would like to create a workflow with a fuzzy logic for date&time. Here, I have an excel with invoice data from vendors. Invoices should be considered as duplicates if the date difference (range) is ± 7 days. The date is then considered as similar. The invoices should then be classified as 1.

At the moment, I don’t know how to check in the whole file for such “similarities”. The check has to be performed from invoice to invoice. I hope someone can help!

Best regards!

takbb · November 6, 2023, 6:12pm

Hi @DA2023, welcome to the KNIME community.

Could you clarify something please.

If you had 4 invoices from vendor A, dated:
(1)1 Oct 23
(2)5 Oct 23
(3)10 Oct 23
(4)14 Oct 23
would they all be considered the same invoice, since according to the +/- 7 day rules:
(1) is same as (2),
(2) is the same as (1) and (3)
(3) is the same as (2) and (4)?

DA2023 · November 6, 2023, 6:37pm

Hi @takbb,

thanks a lot!

Here, I also have the Invoice Amount, Invoice Number and Vendor ID. The idea is to consider transactions as duplicates, if the:

Invoice Amount, Invoice Number and Vendor ID are “exact” and
the invoice date is similar (considered as similar, if difference between the dates is: ± 7days).
If those criterias are true, then the transaction should be considered as a “duplicate”.

I would like to check for each transaction in a column of my list for such duplicates. This means, that I don’t use any other list. The similarity search should be done for all transactions/vendors in this list.

Thanks for your support!

takbb · November 6, 2023, 6:51pm

Ok, @DA2023 so we’d take into account Vendor ID, Invoice Amount, Number when matching but I’m still not sure what happens if you had the following situation:

LineNo	Vendor ID	Invoice Number	Amount	Date
1	A	0123	30.00	01-Oct-23
2	A	0123	30.00	05-Oct-23
3	A	0123	30.00	10-Oct-23
4	A	0123	30.00	14-Oct-23
5	B	0987	30.00	20-Oct-23
6	B	0987	30.00	22-Oct-23

What would be the expected outcome,
e.g It could be this:

LineNo	Vendor ID	Invoice Number	Amount	Date
1	A	0123	30.00	01-Oct-23
5	B	0987	30.00	20-Oct-23

But if not, then what would be the rule for determining which of Vendor A’s rows to keep, since each appears to be a duplicate of the one before it?

EDIT: if I am able to assume that the above example is the correct output, then I think this workflow achieves it:

Similar Invoice Dates.knwf (70.7 KB)

This calculates a max and min date that is considered to be a “duplicate” for each row. It then joins every row to every other row that has the same vendor, invoice number and amount, so that the rows can be compared.

The rule based row splitter then decides whether the row and it’s joined companion are considered to be “duplicates” based on the +/- 7 day rule, in which case they go to the lower port. All others go to the upper port.

A joiner node is then used to match the “duplicate” rows with the rows on the upper port, keeping only those rows that DON’T match using a “Left Unmatched” join condition, so effectively the joiner becomes a filter.

After this, it removes duplication created through the original join, and then tidies up.

becomes

DA2023 · November 7, 2023, 11:18am

Many Thanks for your response @takbb !

It would be great, if the output could be similar to the “duplicate row filter”, with the duplicate classifier (chosen and duplucate) and duplicate identifier (e.g. row 12), so that the duplicates can be extracted and displayed seperately (e.g. for internal audit purposes).

Best regards!

takbb · November 7, 2023, 11:51am

Hi @DA2023 , that can be achieved by a small modification, to the joins. Instead of keeping/discarding rows in the joiners, they simply direct the row to a branch that adds the required classifiers, and then concatenates them back into the output.

Similar Invoice Dates with classifier.knwf (86.7 KB)

As you can see, this simply reports the rowid of a row chosen in favour of a duplicate, but it could be that ultimately another row was chosen in favour of that one too (as per my question about rules when there are multiple “duplicates” across a date range of more than 7 days, where Rows 1,2 and 3 are duplicates of Rows 0,1 and 2 respectively)

After that, you can deal with filtering using the classifiers.

If you want it to select a different “chosen” in the event of duplication, you’ll need to adjust the rules in the rules in the Rule Based Row Splitter.

DA2023 · November 7, 2023, 1:42pm

Hi @takbb,

many Thanks for your great support & explanations!

Best Regards!

takbb · November 7, 2023, 3:23pm

You’re welcome @DA2023 .

system · November 14, 2023, 3:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.