Get rolling dataset from SQL-db

mabm · June 26, 2019, 10:55am

Hi
I can connect to our MS SQL database and load a set of data from the last hour. No problem so far.
If I run the workflow a second time, lets say 10 minutes later, Knime should only get the last 10 minutes and add the date to dataset loaded before and of course drop the first 10 minutes. The goal is to have a rolling dataset of the last hour without loading the whole set.

Right now that doesn’t make much sense, because we can only test in the Analytics Platform, but we should get the Server soon an then the workflow have to run every minute. Loading data from the last hour takes about 49sec. I don’t have to mention that the last 11 seconds for the rest of the processing are very short, if you load the whole bunch of data every time.

Does anybody have an idea how to do this?
Thanks for any tip.

Regards
Martin

ipazin · June 27, 2019, 10:08am

Hi Martin,

interesting idea. I assume you have timestamp with every record?

You definitely need a cut off timestamp in each workflow run. That would be a timestamp from database before your query. Further on idea is to only load data from last hour first time and any other loading is only data from last x minutes. That data will be added to already loaded data. Now you have to filter out all records where: cut off timestamp (the new one) - record timestamp > 60 minutes

Seems a bit tricky to me but doable.

Br,
Ivan

system · December 26, 2019, 10:08pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

tobias.koetter · April 8, 2020, 8:50am

Hi mabm,
I would store the maximum timestamp from the database records that where loaded in the current workflow execution. Up on the next execution you use a sql query that returns all records that have a timestamp greater then the stored max time stamp from the last time of execution. This ensures two things:

You do not load any duplicates since you have the fixed timestamp. Even better would be a db column with a sequence number since depending on the speed the records get inserted and the granularity of the timestamp you still might get duplicates.
If the workflow didn’t execute for what ever reason up on the next execution it will fetch all rows that haven’t been processed so far.

The max timestamp or sequence number could be stored in the workflow using the KNIME protocol and used in the DB Row Filter node via a Flow Variable. Let us know if you need an example for this.
Bye
Tobias