Find matching or nearest datetime stamps in 2 tables

Hi,

I have two txt files which I read in, format and also convert string to datetime for one column each table. Everything works so far.

In the end the tables look like this:

table1
row no column11 column12 column13 datetime1
1 1 Test 7 2016-07-13 12:48:20.417
2 5 TestValue 8 2016-07-13 19:07:37.133
3 2 rudi 1 2016-07-14 03:29:14.450
4 3 No Text 8 2016-07-14 05:57:04.371
table2
row no column21 column22 column23 datetime2
1 Text1 -1.5 2.5 2016-07-12 23:40:44.124
2 sample text 1.6 2.6 2016-07-13 05:00:00.158
3 Yes 0.4 3.2 2016-07-13 12:48:20.417
4 No -1.8 2.2 2016-07-13 17:57:34.536
5 Other -1.2 2.7 2016-07-13 19:07:38.356
6 mine -0.7 2.8 2016-07-13 21:29:07.290
7 day 0.4 3.3 2016-07-14 00:50:56.570
8 text3 0.9 2.3 2016-07-14 03:50:58.737
9 hope -1.0 2.9 2016-07-14 04:26:18.601
10 no -0.2 2.1 2016-07-14 04:58:59.615
11 Yes 0.6 3.4 2016-07-13 07:44:58.933
12 False 1.0 2.4 2016-07-13 09:52:07.715

 

Now I want to use a filter function for the rows of table2 in a way that only rows pass to the output that have either exactly the same datetime stamp as the row in table1 or (if no exact match exists) that is the nearest datetime stamp in positive direction (next). In other words: I want to cycle through all rows of table1, get the datetime1 stamp, look for the datetime2 stamps of table2 to find the equal or nearest next one and pass the result row to the output. So the output should contain a filter set with n rows of table2 (where n is the row count of table1).

So the output should look like this:

output
row no column31 column32 column33 datetime3
1 Yes 0.4 3.2 2016-07-13 12:48:20.417
2 Other -1.2 2.7 2016-07-13 19:07:38.356
3 text3 0.9 2.3 2016-07-14 03:50:58.737
4 Yes 0.6 3.4 2016-07-13 07:44:58.933

Thanks for any help!

Hi Juppes,

might I ask you one question. From where is this question triggered? I got so many very similiar questions during the last days and I am only curious if this is a big coincident?

Cheers, Iris

Hi Iris,

I am sorry I am a complete newbie to Knime and to this forum as well and so I obviously made some mistake.

I tried to post this question the first time just after I registered here. Then I saw that my post stayed in the "my contents" box and was not visible in the general forum as I defined it to be. I looked for it a few times again and the post did not appear in the forum.

So I decided to have made something wrong and posted the question again. But also the new post stayed in the my contents box and was not published in the next hours.

Today when I looked into this forum again I saw that there were both posts shortly one after another.

I am not familiar with this forum technique and made other experience in other forums so I must excuse for that.

If you (or anyone else) is able to delete one of the two posts it would be nice. I also wanted to delete one but unfortunately did not find the way to do that.

However it would be nice if you or anyone else has an idea for an answer to my question;-)) I tried to find a solution prior to my posts but did not find any hint to my problem.

Regards, Juppes

--------

Just a short update:

Now that the posts are published in this forum I was able to delete the first (unanswered) one myself:-)

Juppes,

Iris's confusion / déjà-vu is due to the following, also accidentally duplicated post (appears to happen a lot lately, too):

https://tech.knime.org/forum/knime-general/comparing-date-ranges-from-two-tables

I'll post its main ideas again (didn't get tried/confirmed so far), with some simplification to a step-by-step proposal to match your needs:

  1. Create unique RowIDs on both tables
  2. Add the new RowIds as a column on either table (RowId1 and RowId2, achieved with the RowId node)
  3. Filter down the columns to only keep these "RowId" and "datetime" columns.
  4. Cross-join both tables
  5. Compute the time differences between datetime1 and datetime2 (let's call this the "selection table")
  6. GroupBy node: aggregate "Timediff" column to "Minimum"
  7. Reference row filter node: upper port gets "selection table", lower port gets previous step's "GroupBy" result table. "Include" columns on "time diff" v "time diff minimum"
  8. Use the result of the previous step to "Reference row filter" RowIds from either original table.

HTH.

-E

Hi Ergonomist,

thanks for the quick answer, it looks like it should be doing exactly all that I want ;-))

But I need a little help on the GroupBy node, I am not so familiar with that one (like most of the others). I assume that I have to configure here (4 tabs):

- Group Column: "Timediff"

- Manual Aggregation: ?? (RowIDs of both tables?)

- Pattern Based Aggregation: ?? (I assume nothing)

- Type Based Aggregation: Data Type: Number (long) (because I calculated the Timediff in ms); Aggregation: Minimum; (any further setting?)

- Any further setting in the "Advanced settings" at the bottom?

Thanks again for your help.

Hi Juppes,

Sorry, through your question I realised that I missed out on one: you need to set the "Group" tab to the "RowId1" column. "Timediff" gets manually aggregated to "Minumum". Make sure to leave the other aggregation option unfilled though, as they will otherwise generate aggregations of their own.

So you are basically looking for one "Minimum timediff" per "RowId1" group field - that's how the GroupBy logic works. It's a database construct which takes some getting used to coming from Excel - been there, done that. :)

HTH,
E

Hi Economist,

thanks a lot for your valueable hints! There was one point left out (glad I found it myself ;-) ): The 'time diff' values had to be absoluted by the Math Formula node additionally after calculation because otherwise the Group By node always takes the minumum (far away minus) value.

Thanks again, you made it work!

Juppes

Hi Juppes,

Absolutes absolutely make sense! ;-) Glad to have gotten you started. :)

Cheers
E