Joining tables on dates

Hi all,

I have two Excel tables which have to be joined specially. The first table contains events with a certain date, the second contains measures with another certain date. For example:
Table 1:
Event 1, 18.4.2016, Identifier 1
Event 2, 20.8.2016, Identifier 1
Event 3, 22.10.2016, Identifier 2

The Table 2 contains measures to each Identifier:
Measure 1, 23.2.2016, Identifier 1
Measure 2, 21.7.2016, Identifier 1

The task is to joint Table 2 into Table 1 to get following table:
Event 1, 18.4.2016, Identifier 1, Measure 1, 23.2.2016
Event 2, 20.8.2016, Identifier 1, Measure 2, 21.7.2016
Event 3, 22.10.2016, Identifier 2

Note that it may happen that an Event has no Measure or that an Event has more than one measure (even more than two measures).

What do you think, is that possible in Knime?

Hi,

You just need to use a “Joiner” node and set the “identifier” column as the joining column and “Left Outer Join” as the join mode (if the first table is connected to the top port).

Best,
Armin

Hi,

Does the joiner take into consideration the date of the event which has to been mapped to the Measure with the date before? Please look to my example: Event 1 is mapped to Measure 1 and Event 2 is mapped to Measure 2 although both events have the same identifier. Therefore I need probably a more intelligent way to join the tables. Isn’t it?

In your example the date for event 1 is after the measure 1 (identifier 1) but the date for the second event is vice versa. So I didn’t know you want to apply the rule as well.
Please explain your desired output perfectly and I will try to help you again.

Armin

Hi Armin,

thanks for looking into it. You were right, I messed the dates up. I corrected them in my original post.

So first convert the date values to “Date&Time” format by using a “String to Date&Time” node and then the “Joiner” node as I mentioned:

And finally use a “Rule-based Row Filter” node with the expression below to include rows in which the first date (table 1) is greater than the second date (table 2):
$date1$ > $date2$ => TRUE

Best,
Armin

1 Like

Hm, this sounds easy and I have done this already in a first try. But if I take my example, I would get after the join either
Event 1, 18.4.2016, Identifier 1, Measure 1, 23.2.2016
Event 2, 20.8.2016, Identifier 1, Measure 1, 23.2.2016
Event 3, 22.10.2016, Identifier 2

or

Event 1, 18.4.2016, Identifier 1, Measure 2, 21.7.2016
Event 2, 20.8.2016, Identifier 1, Measure 2, 21.7.2016
Event 3, 22.10.2016, Identifier 2

A rule based row filter would be useless after the join. Am I right?

Is there any criterion to differentiate between events and measures? I mean for now we have those counters, but are there the same counters in your real data (or something similar)?
If yes, the second row in the final output can be excluded as well. And if you want to keep the last row in the joiner output, we can add an additional rule to the Rule-based Row Filter node.

Armin

1 Like

Hi Armin,

first to the easy things: yes, I need the last row of the joiner output.

next: the task is to map those measures to the events which are done next before the events. Therefore the final output needs to be sorted out further. The second line has to been removed. It’s even more complex since there could be for an identifier even more measures than two.

What do you think?

BTW: I didn’t expect that the joiner works this way. Thanks for sharing.

Well, I guess I have found the solution:

The first steps are the same as what I have suggested earlier except I have added a new condition in “Rule-based Row Filter” node to include missing measures.

The rest of the workflow:
I have generated a new ID column. Then I used a “GroupBy” node (grouping on measures and picking the “First” ID). Finally I included rows based on the selected IDs by using a “Reference Row Filter”. The Column Filter node removes additional columns.

Please check the workflow and let me know if this is what you were looking for:
event-measure.knwf (41.9 KB)

Best,
Armin

1 Like

We are coming close to the solution. I need for each event just one line in the output, i. e. the final output has to have exactly as much lines as the table 0:1 (upper left corner in your picture). Row0_Row0 and Row_Row1 has to be removed as well as Rwo1_Row3 since it contains not the nearest measure timewise.

But I think we can handle this with an additional group by events and using maximum aggregation of date (#1). Then we can join both tables again on event and keep only those entries which are identically in date (#1) and maximum(date (#1)).

1 Like

Hi,

I didn’t know that.

Yes. That’s right.

I’m glad I could help you with your issue.

Best,
Armin