Matching rows based on multiple criteria and from different categories.

Hello,

I am new to the forum and a beginner with KNIME. The forum has already been tremendous help in getting started, so thank you to all active members; your help is appreciated! Hopefully the upload below will provide more context.

I have two sources, PARF A and PARF B, that have data I need to match. Both sources were combined in one table but can be separated if needed. Each match must have matching Invoice IDs, matching Acquisition Names, and matching Amounts. However, the transaction codes must be different. The PARF A source has either “ACQU_DR” or “ACQU_CR” transaction codes and the PARF B source has either “AP_DR” or “AP_CR” transaction codes. When matching, I need to ensure that the match is from “ACQU_DR” to “AP_CR” or “ACQU_CR” to “AP_DR”.

An example of a match:

Acquisition Name - Invoice ID - Amount - Tran Code - Source
Cardinal 100001 -10.50 ACQU_CR PARF A
Cardinal 100001 -10.50 AP_DR PARF B

Hopefully this is enough context and information. Thank you in advance for your help!

Hi @ricardomorenojr , welcome to the forum!

Without seeing your workflow and having a sample dataset, it’s difficult to help. Please upload both of these items.

Secondly, can you explain on the other columns you haven’t addressed, such as Data, Age, Payer Name. From which table (Part A vs Parf B) would you like their values to inherit from?

Thank you.

Hi @badger101 thank you for your response. Please find the workflow (not letting me export so can only add a screenshot for now; troubleshooting in the meantime) and sample dataset attached.


Forum Data.xlsx (31.1 KB)

The WF takes the two sources mentioned above and also incorporates two sources/populations which are used to add acquisition names and invoice IDs to the tables used. The bulk of this workflow is standardizing both PARF A and PARF B reports so that a comparison can be made. The two tables are joined towards the end. It boiles down to the last node from which the above excel came from. Please note that it’s possible that the current data set might not have any matches as this is a monthly procedure we perform and we’ve matched what we could for this round.

The other columns would need to be kept. Unlike a join where the matches are combined, I’m looking for a solution that will simply list out the matches based on the criteria above and the remaining columns will follow, similar to a filter. If a join must be used, the remaining columns could be set as “Age (Left)” for PARF A and “Age (Right)” for PARF B or something similar.

Please let me know if you need further clarification. Your help is greatly appreciated!

Thank you @ricardomorenojr ,

I’m trying to visualize what you mean by this:

Unlike a join where the matches are combined, I’m looking for a solution that will simply list out the matches based on the criteria above and the remaining columns will follow, similar to a filter.

For this matter, can you design 3 dummy tables using Excel? Where 1 is for sample dataset from Parf A, 1 is from Parf B, and the final one being an example of the outcome you want?

This way, I can see what goes in and what goes out. If that makes sense.

1 Like

It should definitely be solvable :slight_smile:

Quick draft with a dozen nodes based on what I understood from your initial post.

But agreeing with @badger101, after your enrichment we can come to a final conclusion and validate what we had in mind.

2 Likes

Hey @ArjenEX - Thank you for taking a stab at it; this honestly looks pretty much like what I need.

Attached is the supplementary information asked for by @badger101

REFUNDS FORUM EXAMPLES.xlsx (74.6 KB)

Thank y’all again for the help!

Hi @ricardomorenojr , it looks like the solution attempted by @ArjenEX is very close to the final table you uploaded. Just need some tweaking I think. I’ll pass the baton to @ArjenEX to help you close this case. You’re in good hands! :ok_hand:

1 Like

Thank you for the help! @badger101

1 Like

No worries! :grin: Best of luck to you.

I was playing around your file, to test something out. I’m not sure if this can be an optional solution, but I’ll throw it out here anyway cause I find it interesting thinking if there’s a chance this magic node is all that you’ll ever need:

Here’s what you’ll get:

The table structure doesn’t have the same row arrangement to your outcome table, but it gives you the following:

  1. It rearranges and matches the twin rows on top of one another.
  2. It shows an optional column to point out that the bottom row for each matched pair is the duplicate row.

If this is usable and you’d like to test it out, here’s the configuration for the node:

  1. For the Options tab:

  1. For the Advanced tab:

Let us know how it turns up when you apply it to the real dataset. @ricardomorenojr

2 Likes

Hi @badger101 - the only issue I see with your solution is that it is pairing items that are not compatible. ie. Row 2 and Row 2_dup have ACQU_CR and AP_CR, respectively, and that cannot be a match. I am looking to get only ACQU_CR to AP_DR or ACQU_DR to AP_CR.

Ok thanks for the feedback @ricardomorenojr . I think @ArjenEX will know what to do :ok_hand: Let’s wait for a better solution.

1 Like

Hi Guys… just a point of view… the joiner… If you use it, you can select what it can do too… joins the left, the right, both informations… so, if you just compare and made a match of some information, it will setup as you say for it… see the print below.

  • First, you will set the pairs to match.
  • After that, what will you use? the left, the right, both information? as you can see here, it’ll bring the information to complete your data table.
  • Third, you can output it on a unic output or a separated one… as top to match, only the left table, only the right table and compare if you have doubts.
  • If you have the same columns, it cam be merged!!! so you needn’t to make others transformations.
  • For the end, you can set new rows ID, add a separator with both ids or use the sabe keys…

Don’t forget to set you inputs!!!

At the “Column Selection” tab, you’ll set what you need from the table A, and from table B to complete your information…

Well, I’m just tring to help… I hope that tip can be usefull for you!

Tks,

Denis

1 Like

Hi @ricardomorenojr

Sorry for the delay, was occupied with stuff.

Disclaimer: You can go various ways with this. This is a way to do it. It’s always a tradeoff between like workflow performance, data set size, being comfortable working with certain nodes, etc.

I’ll illustrate two; one trough the Collection route and one through the Appender route. The final result is the same.

General:
I start with renaming the columns of both sources and also adding it as seperate column. This makes tracebility in your joiner and rule engine downstream a lot easier and gives more insights into the source that you are working with. The Column Rename Regex is usefull here whereby you can add a suffix to each column name, in this case I opt for PARF x

The subsequent joiner is pretty straightforward with a left outer join on the fields mentioned in your opening post as being the comparators. Also bringing the entire dataset from PARF B along.


By the looks of it you can find your way through a joiner, otherwise refer to the additional pointers of @denisfi

I see in your workflow screenshot a common practise of Rule Engine + Row Filter. You can combine this when using a Rule Based Row Filter. Applying your logic in accordance with the transaction codes:

($Transaction Code_PARF A$ = "ACQU_DR" AND $Transaction Code_PARF B$ = "AP_CR") OR ($Transaction Code_PARF A$ = "ACQU_CR" AND $Transaction Code_PARF B$ = "AP_DR") => TRUE

Again, better understandable with the renamed columns and less prone to mistakes.

Here, you are at a junction. Since you want to have the entire rows of both PARF A and PARF B underneath eachother, I opt to handle them as entire group and create this structure.

Option 1:
Create a collection of data with all columns related to PARF A. Again, the renamed columns help here because I can apply a wildcard to capture all columns related to *PARF A, KNIME automatically includes all eligable columns. It’s more conveniant to remove the aggregated columns from the input to avoid cluttering. Also designate that these are the aggregated values of A as per the table output.

  • Repeat the process for B.
  • Repeat the process to merge the A collection and B collection.

This will allow you to Ungroup them through the associated node. This places the A and B group rows underneath eachother.

Split the collection back into the original columns with a Split Collection Column node and a Column Rename to rename them according to your table.
Note: this assumes a fixed input format in terms of column order. If this changes, more work is needed to capture that.

This generates the final output that you are looking for in accordance with your provided Excel sheet.

Option 2:
Exactly the same principle, just different nodes. Main advantage: it “remembers” the spec of the columns which the Collection route is not very good with.

image

I’d say it depends on your use case if this is a gamechanger or not.

Option1 execution time: 44 ms
Option2 execution time: 43 ms

You can draw your own conclusions from that :wink:

Note: can agree with your observation on the duplicate row filter suggestion. It needs the code comparison logic to accompony it but in larger dataset it can be used as a preselector to determine candidates subject to comparison.

See WF: Matching rows based on multiple criteria and from different categories.knwf (122.8 KB)

Hope this provides some inspiration!

2 Likes

Eternally grateful, @ArjenEX !! Thank you for the detailed explanation, you have been a great source of help and, of course, inspiration.

Shout out to @denisfi and @badger101 for their respective contributions as well.

Thank you all!

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.