Add a "Advanced Joiner" base node

takbb · June 13, 2023, 6:09am

Hi @iCFO , as I know you use the Join Custom Condition, I’d be interested in your views, and if you’d be in a position to “beta test” a rewritten upgrade to it …

For large datasets this should be significantly faster than its non-indexed predecessor and it also shouldn’t generally suffer from overall problems of column data type compatability with the H2 database. It now only tries to upload the columns to H2 that it finds in the conditions, and not all of the other columns. It also tries to index the columns used in the condition. If the theory works it represents a massive performance improvement. It is important to note that for indexing to work, tables must be referenced in the conditions using the lower-case t1 and t2 names, and column names must be included in double quotes. See the help doc on the component. Be good to know how you (or anybody else reading this) get on. TIA.

How much of a performance gain is “massive”?

Well… I joined a 250k row table with a 150k row table using a simple equality join on one id column from each table.

The standard joiner node took 1.5 seconds to complete the job.

The new indexed version of the component took 3 seconds.

The old non-indexed component took 1.5 hours!

My original python based pandaSQL component took 4 hours.

It seems to be quite good…