Date Scaffolding - Join Dates with Less Than / Greater Than Function

takbb · June 10, 2023, 9:35am

Unfortunately, as you have found, the KNIME joiner nodes don’t currently support non-equi joins. It would be good if one day they do, and I would encourage anybody reading this to go vote on @iCFO 's post in “feedback and ideas”

In the meantime, this “missing feature” has generally meant a number of alterative solutions being proposed. I’ve listed some options below. There may be other alternatives I’ve not considered.

The cross-join and filter solution
One solution to this relatively common problem involves performing a cross join between the tables, followed by a rule engine or rule based row filter to then perform the row matching. This is the approach taken here:

The Binner (Dictionary) Solution
This is not an area I have particularly good knowledge on, but it would be remiss of me not to mention it, as I know it often comes up as a proposed solution to this type of problem.

An example of its use can be found here,

but I’ll leave it to others (@Daniel_Weikert ? ) to demonstrate that one if you want to know more, lol!

The Python pandasql solution
Another possible solution, if you have python installed is to write the data tables into pandas dataframes, and join using pandasql.

This was the basis for an early component I wrote to provide a possible solution, and is demonstrated in the following workflow on the hub:

The looped row-filtering solution
This was a more challenging approach involving looping through the rows on a smaller “lookup” data set and then performing a cross join and row filter on the primary data set with each row in turn. This has the benefits of not requiring the “exploding memory” of a full on cross join where larger datasets are involved and doesn’t suffer “datatype” compatibility issues of the H2 solution. However it is much more complex to write and in most circumstances probably slower. However, theoretically, there may be a point at which it out-performs other solutions if available machine resources are limited. It was the approach used in one of my joiner components on the hub the “Regex Lookup Filter Joiner”.

The local (H2) database solution
My final proposed solution, which is similar to the pandasql idea, but not requiring python is to involves wiring the data to a local sql database such as the H2 database that is available out of the box with KNIME and then performing the join with SQL, and this is my preferred general solution, but there are some limitations because to do this, the tables you are joining need to contain only datatypes that can be inserted into H2 tables. The workaround if they don’t is that use column filters to return a pair of tables containing just the key fields that you want to non-equi-join on, use the component to join those and then join the resultant keys back to your original tables to pick up the remaining columns. With large numbers of columns, it may be better to join together tables containing just the key columns anyway, for performance reasons.

This was also a solution discussed in @mlauber71’s Hub article:

Rather than having to build the additional logic every time, the above have H2 database solutions been turned into components available on the hub and which are described in the following post, which I hope you will find useful: