Efficient way of doing a JOIN ... WHERE ...

I have two tables table_a and table_b the first table has a time span data in stored as two columns start_date and end_date the second table has time point data. I want to join those table in such a way that only those rows match where the time point lies within the time span.

My current way if a two step. (1) do a cross join, than (2) apply a row filter.

This is super inefficient. I have something like 300’000 rows in one table and 4*365 (4 years each having 365 days) in the other one. The resulting table should be something like 500’000 rows but with this approach the table first becomes 438 million rows long before getting filtered down again.

I tried looping over the time pionts but this still takes considerable amount of time. Using a database - which would implement this quite efficiently - unfortunatly is not an option.

Any suggestions, I suppose this is a very common pattern and most of the time will result in the same problem?

2 Likes

Hello @petermeissner,

Can Binner (Dictionary) be used in your case?

indeed this is a common request and there is ticket for this specific and similar joins. (Internal reference: AP-10692). Have added +1 to it.

Br,
Ivan

3 Likes

Well, not a particularly sophisticated idea but you could try and use a local H2 database to join two datasets by time ranges. Depending on the power of your machine and other factors that might work. I tried it with files of different sizes and at least it gives you a result.

Until there is a solution in KNIME you could explore further the possibilities of using SQL to make the join more efficient. I took from this older entry:

There might be some special limitations when working with a local H2 database. Depending on the power of your machine you could try and load the H2 DB into memory which might speedup the process.

1 Like

Hi @petermeissner

is it possible to get some sample data of your two tables. I have some thoughts but enter data manually is to time expensive. Hopefully its possible, thanks.

BR

Thanks. External tools - like a database is not an option. I cannot just add tools.

@petermeissner well that is the beauty of it, it is a database but works just as a standalone file, even in memory. You have your own database integrated into KNIME without the need for further installations or tools. I would recommend checking it out.

1 Like

Thanks!

This equivalent to an SQL-Join…Where…Between.

This will not fit all use cases but it really fits well the time point within time span use case I am trying to solve. Works great for me and is super fast.

test_join_where_between.knwf (28.6 KB)

2 Likes

Hello @petermeissner,

glad to hear that. Additionally, if applicable, I suggest to use new Joiner (Labs) node to gain more performance.

Br,
Ivan

Thanks. Supercool. Its fast and might be the solution to a vast amount of data management problems that were so far out of reach or simply very cumbersome.

test_join_where_between.knwf (44.0 KB)

2 Likes

That is one of the big benefits of KNIME also being a platform. It happily incorporates (and connects to) other tools and functions. So there is no need to choose - you can have it all :slight_smile:

Wait until you discover it’s great scalability …

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.