Add a "Advanced Joiner" base node

iCFO · March 3, 2023, 4:13pm

I recently found the “Join Custom Condition” component by @takbb and it has opened up a world of simplicity and possibilities to complex projects. That being said, it still lacks the options and left / right outputs of the Joiner node which requires a series of additional nodes for condition testing, reference filtering, column combining, etc.

I think that the addition of a new “Advanced Joiner” node could superimpose these “conditional” capabilities over the standard “Joiner” node, and it would open up a new realm of possibilities to all KNIME users by maintaining a more user friendly and flexible UI. This node could replace dozens and dramatically simplify highly complex joins.

Basically you could have the interface look the same for your basic Input1/Column1 = Imput2/Column1 join condition logic. Then add additional “criteria tests” (which would effectively function as the “where” clause in a database join). The advanced criteria tests could be managed via a user interface similar to entry of filter criteria in “Row Splitter (Labs)” to provide user friendly and / or and grouping controls. It would need the ability to incorporate “like” / “matches” as well. Both the basic and advanced Join Criteria could be controlled via this Row Splitter (Labs) interface approach as well, if that turns out to be the easier user interface approach.

ScottF · March 7, 2023, 8:20pm

Hi @iCFO -

Good feedback here, thank you. Just so you know, the dev team has put quite a bit of thought into future ideas for Joiner improvement, represented by several different internal tickets (AP-10692, AP-5849, AP-8926, AP-18775) to name a few.

FYI @nan

rrembert3 · March 9, 2023, 8:10pm

Hello -

How do we view the internal tickets to see what has already been requested?
Will advanced joiner node support complex joins? It would be helpful to be able to migrate advanced SQL queries to KNIME. I have tried running multi-join SQL queries using Python but the performance has been quite slow.

Just to be clear, I am aware that I could simply put together multiple join nodes in to accomplish a join between many data-sources, but this is associated with a slower run-time than my previously existing solution in MS Access.

ScottF · March 9, 2023, 8:12pm

Hi @rrembert3 -

Our ticketing system isn’t publicly accessible, sorry for that.
That’s the hope, although the design is still in flux from what I understand.

Edit to add: if this feature is one that’s of particular interest to you, please use the voting system!

rrembert3 · March 9, 2023, 8:20pm

I see, thank you. How do we use voting system? Additionally, would be nice if a direct SQL statement could be used in a join involving at least one non-database source. i.e. if source #1 is an Excel file. This would enable easy migration of MS Access solutions to KNIME, IMO.

iCFO · March 9, 2023, 8:27pm

Hey @rrembert3

Try this in the mean time. I have been using it for highly complex joins between excel files. You just need to make sure your data types are H2 compliant on the way in.

rrembert3 · March 9, 2023, 8:40pm

@iCFO thanks. Can that handle > 2 datasources in a single join? I need both pieces for this to enable my solution.

ScottF · March 9, 2023, 8:40pm

Hi @rrembert3 -

More about voting in this topic:

Basically, just click the big yellow button at the top.

iCFO · March 9, 2023, 9:00pm

Hey @rrembert3

The component is currently only built to join 2 tables as an “out of the box solution”, but the write to H2 in memory approach should be able to be customized for additional tables. I am going to tag @takbb in case he has some pointers for you on the modification side.

Here is an example of a triple table join via database. The H2 in memory approach could be utilized in the same manor although it may require some adjustments to the syntax.

takbb · March 10, 2023, 11:29pm

Hi @rrembert3 and @iCFO ,

The “Join Custom Condition” component was really there to provide a facility for doing more complex or non-equi joins (such as “between” or using wildcards)which couldn’t be handled easily by the standard Joiner node. To keep things simple, I just implemented it for joining 2 tables.

However… I had a little play and have reworked it into two “spin-off” components:

and

NB They no longer have config for the column filters , as this isn’t really practical for a larger number of tables within the limitations of what can be done with a component rather than a real node, and also there is no longer provision for choosing to keep “left unmatched rows” as this again becomes a bit complicated where more than two tables are involved. Column filtering would have to be performed using a Column Filter node after the component.

Within those limitations though, you may find that these have something to offer.

If you take a look “inside” at both the “3 way” versus the “4 way”, you’ll see you could make a “5 way” or more (if you really need to!)

This is the “3 way”

This is the “4 way”

I suspect you can guess what a “5 way” would look like

Each input table to be joined is “encapsulated” in a sub-component “JoinedTableComponent”. This inner component contains a single config item which is its name. So they are named T1, T2, T3, T4… and if you added another you’d call it T5.

Assuming it works the way I intended (and be warned I haven’t done much testing!) , it should allow you to provide join predicates similar to the way they are done in the “Join Custom Condition” node, referencing each table by the table name (e.g. T1, T2 and so on).

An example demo workflow is provided here:

I hope it is of interest, and possibly useful, or at least a starting point for further development!

system · June 8, 2023, 11:29pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

takbb · June 13, 2023, 6:09am

Hi @iCFO , as I know you use the Join Custom Condition, I’d be interested in your views, and if you’d be in a position to “beta test” a rewritten upgrade to it …

For large datasets this should be significantly faster than its non-indexed predecessor and it also shouldn’t generally suffer from overall problems of column data type compatability with the H2 database. It now only tries to upload the columns to H2 that it finds in the conditions, and not all of the other columns. It also tries to index the columns used in the condition. If the theory works it represents a massive performance improvement. It is important to note that for indexing to work, tables must be referenced in the conditions using the lower-case t1 and t2 names, and column names must be included in double quotes. See the help doc on the component. Be good to know how you (or anybody else reading this) get on. TIA.

How much of a performance gain is “massive”?

Well… I joined a 250k row table with a 150k row table using a simple equality join on one id column from each table.

The standard joiner node took 1.5 seconds to complete the job.

The new indexed version of the component took 3 seconds.

The old non-indexed component took 1.5 hours!

My original python based pandaSQL component took 4 hours.

It seems to be quite good…

iCFO · June 15, 2023, 12:37pm

Hey @takbb

I will be starting a very complex audit late next week. I will be using your new indexed joiner in dozens of times in series to pull it off. Thanks for the upgrade! I will let you know how it goes.

takbb · June 15, 2023, 12:39pm

Good luck @iCFO. Obviously let me know if you discover any issues, and I’ll do my best to resolve them. Learning all the time!