Date Scaffolding and Non-Equi Joins at DB Level

qdmt · May 19, 2024, 3:16pm

This might be a question to @takbb, have been using the “Join Custom Condition - indexed - takbb” component with good results over the past year after your recommendation in this thread.

I now have the opportunity to push the logic to the DB (H2) level. Wondering if I can get any pointers or things to watch out for. I imagine a straight SQL query should do the trick, but the component itself is a maze within, so I’m wondering if I’m missing something in the complexity of the challenge.

takbb · May 19, 2024, 3:21pm

Hi @qdmt , glad to hear the component has been working for you.

Most of the internals of the component are there just to make it work generically “as a component” and if you were writing a single specific solution by hand, you wouldn’t need most of what is there. And yes it probably is rather difficult to follow.

The main thing I would start with as a pointer is the significant performance improvement that came about by using indexes in the database.

I’m happy to give guidance but do you have specific questions that I can help with?

takbb · May 20, 2024, 10:12am

Hi @qdmt, A thought occurred to me… if you are wanting to push the logic totally to H2, then I wonder if this demo workflow is of use.

It has a couple of other components I put together where the aim is to place copies of KNIME tables on an H2 database so that you can then quickly perform SQL queries on the data using all the standard KNIME DB nodes.

The components used here are (slightly) less complicated than the JOINER component you mentioned, because their purpose is to only place one table into the database, instead of two (plus all the joining code).

They still perform the indexing though so that SQL queries can execute faster.

qdmt · July 13, 2024, 10:52pm

Apologies for the tardiness here, but I promise I had this tab open to come back to , and wanted to properly tackle this with a working solution first to come back in case there’s any feedback on the approach.

I think in the end - a SQL executor node seems to the trick, but H2 proved too slow, and pivoted to Postgres. It might still need some optimizing.

This is all in one SQL Executor node:

delete from tbl_staging 
where  "End Date" <  
		((select coalesce(max("Scaffold Date"),current_date) as last_scaffolding_date from public.tbl_scaffolded) - interval '2 day')::date;

delete from  public.tbl_scaffolded s
where exists 
(
	select 1 from  tbl_staging m
	where s."KeyID" = m."KeyID"
);

INSERT INTO public.tbl_scaffolded
SELECT 
"KeyID" ,
"Start Date"::date AS "Start Date",
"End Date"::date AS "End Date",
ds.scaffold_date AS "Scaffold Date"
FROM tbl_staging cd
JOIN  public.tb_date_lookup ds on ds.scaffold_date <= CURRENT_DATE and  ds.scaffold_date BETWEEN cd."Start Date" AND cd."End Date";

takbb · July 14, 2024, 8:28am

Hi @qdmt , I don’t think there is much I can add as I don’t know anything more about your task but I can see you have a staging data table from which you want to insert/replace rows since the last staging date into a main (“scaffolded”) table.

Did you describe your task elsewhere on the forum? (wondering now if this thread has been broken out from another thread with more detail about what you were wanting to achieve. My earlier response was very general as I didn’t have any other information about what you were wanting to do. ie I just thought you were wanting to put some data in an h2 database and then do some joins on it).

What you are doing looks fine to me, as here you are just using KNIME here as a vehicle for executing the sql. I’m guessing that maybe KNIME is used earlier to populate the staging table first and then the remainder of the logic is performed by the DB SQL Executor node as shown above. That should be an efficient solution if you are happy with writing and maintaining the sql directly.
Thanks for sharing.

qdmt · July 15, 2024, 1:05am

Hopefully not broken out, this is the only active thread on the topic (from me at least), asides from the older one already linked in the first post of this thread.

So also sharing in hopes of just circling back with a SQL approach to a non-equi join. I tend to use a staging table so that the join is happening db <> db instead of KNIME table <> db.

With this, it takes about 7min to process ~45M rows. Happy for now, but still think a non-equi join should be available natively in the Joiner node

system · July 22, 2024, 1:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.