join tables based on validity timestamps

schneich · September 1, 2023, 9:54am

Dear community,

Question: How do you join a table, if you have to check something like “take the timestamp from table A und see, if you find a timestamp on table B, which is the same or older. If older, only the newest one is valid.”

Background: we are migrating our employee data from an old application into a new one. The old application has a main table with the employee data and 3 referenced tables, each of which provides historical information on some aspects of the employee. The key for the referencing the tables is the employee ID.

The new application has a main table with the employee data and 1 referenced table, on which we have to combine the information of the 3 tables from the old application.

Unfortunately, on each of the old tables, there is only a “valid from” timestamp. Each of the 3 tables has entries independent form the other tables. When joining the data, we need to find the rows belonging together, based on the timestamp.

Example file with the 3 tables: Test.xlsx (14.7 KB)

Any input is very much appreciated.

takbb · September 1, 2023, 11:19am

Hi @schneich , and welcome to the KNIME community!

KNIME doesn’t have a standard node for performing non-equal joins and the joiner node handles two tables at a time, so for such joins we have to sometimes get a little creative.

I gather that MA_Nr is “employee number” in the three tables.

Using Employee with id 1 as an example from your data, can you give an idea of the steps you would take if performing this manually and the output you are expecting for that employee, as I wasn’t sure which tables took priority in the date ordering, and what the end result would look like.

Here are screenshots from your file for ease of reference:

Tab1

Tab2

Tab3

(the ####### in Tab3 is the value -1 , which I presume means missing/null )

I also wasn’t sure, if we look at rows “Row0” and “Row2” how you would choose between them if they happened to be the “match” as they both have the same date.

btw thank you for uploading a sample data file, which is always helpful. So many times we just get a screenshot and we are expected to “imagine” how it might work, or have to type data in ourselves before we can even get started!

schneich · September 1, 2023, 12:31pm

Hi @takbb,

thank you very much for your quick response. The target-table looks like this:

Well, by looking at it, I realise that there are some duplicate time stamps in there and it is not clear, which one would be the correct row. My colleague will check with HR on monday. For now lets take the assumption that we can make the timestamp unique per source table.
Yes, Ma_Nr is the employee-ID.
If doing this manually, I would…
** …create a new master table, with all the collums of the 3 source tables.
** Then, I would copy all the data into their own rows. This would result in many empty collums.
** Then, I would sort after employee ID and the timestamp.
** Then, where there are empty fields, I would fill in the blanks with data from rows before.

new test file: Test.xlsx (18.0 KB)

Does this make sense to you?

Chris

takbb · September 1, 2023, 2:40pm

Hi @schneich , Thanks for the additional target output.

I’m still not quite “getting” the rules for comparison, and as a result I don’t understand how you ended up with 6 rows in the target output for Emp 1.

As the sample data contains similar dates across all the tables, it’s difficult to work out which table each of the “Validity” dates originated from for each row.

When you are comparing dates, you mentioned the principle: “take the timestamp from table A und see, if you find a timestamp on table B, which is the same or older. If older, only the newest one is valid”, but what I don’t think you’ve said anywhere is that when comparing tables 1, 2 and 3, which do you consider table A and which is table B in each comparison? i.e. are you looking for table 1 dates to be newer or older than table 2, and again for table 3… etc.

schneich · September 1, 2023, 2:59pm

Hi @takbb,

the table contain history data to an employee.

Tab1 contains salary related infos. Such as working 100% or 80%.
Tab2 contains address related infos. Employees move from time to time.
Tab3 contains tax related infos.

My wording was probably leading you to wrong conclusions: I should have said, the date needs to be combined/merged.

If a timestamp is unique for an employee, the data is written into a new row of the new table.
If a timestamp (-> date) exists in 2 or 3 of the tables for the same employee, information should be merged into one row of the new table.

Doing this, there will be many empty fields in the table. These have to be filled with the last know state of that attribute.

Example:
In Tab3, there is a timestamp for employee 1 → 2018-10-01 00:00:00
When copying this data into the target table, it would look like this:

The red fields need to be filled as well, with the last known state from the Tab1 and Tab2, e.g.:

morpheus · September 1, 2023, 3:33pm

Hi @schneich,

I am currently not close to a pc. Therfore i try to describe my thoughts in words.

Use group loop for looping over your mastertable with employee and date as group condition.
Apply a rule based rowfilter on each subtable where employee = variable(employee) and date <= variable(date)
Use groupby node on each filter node with employee as group condition and maximum(date) as aggregation.
Join groupby nodes back to your filter nodes employee = employee and date = maximium(date). Now you should have only the latest record of your subtables,
Join your subtables to your group loop start node
Modify your masterdata if needed with e.g. a rule or java script node
remove the subnode columns
Connect to loop end node
This output table now represents your new table.

BR

takbb · September 1, 2023, 5:16pm

Hi @schneich … ok, yes I was fixating on the original wording and having re-read what you’ve said, I think it became clearer, and I now better understand the process.

The “merging” of the three tables can be performed by two joiners, each joining both matching and unmatched rows into a single output table. The first joins t1 to t2 based on MA Nr and date from each table, the second joins the result of that to t3 based again on MA Nr and date, so we then have the three tables merged. It was important on the joiner to “Merge join columns” so that the second joiner had a single pair of columns to join on.

After that, the output can be sorted by ascending MA Nr and Date Order, and missing values filled down in a group loop, (to enable us to fill down one employee’s records at a time).

Hopefully that should be a step in the right direction

The workflow can be found here:

schneich · September 6, 2023, 8:51am

Hi @takbb,

my colleague was able to test your workflow. It works like a charm!

Thank you very much for your help,
Chris

takbb · September 6, 2023, 8:52am

You’re welcome @schneich . Glad it worked for you and thank you for marking my answer as the solution.

system · September 13, 2023, 8:53am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.