Running workflow to only append with new data

MollyR · September 14, 2021, 1:11pm

Hi All,

I am trying to find a way to use my current workflow and have it only update with new data when run but still keeping the old data. The data set is going to become very large, so I would like to do this to cut down on run time and ensure we keep all previous information that has loaded. Please let me know how to do this without having to rerun the original data pulled.

Thanks,
Molly

bruno29a · September 14, 2021, 1:15pm

Hi @MollyR , the only way to do this would be if you have incremental unique identifier, like an incremental id, or timestamp, where you can check what’s your last record in your “old data”, and retrieve anything beyond that point from your data source.

Can you share some sample data? If not possible because of data sensitivity, can you at least share the structure of the data (column names and types)?

MollyR · September 14, 2021, 1:21pm

Thank you for replying so quickly. The data is sensitive so unfortunate I cannot share but here are the column names. I do have a date column, so if you could continue with that train of thought it would be great. I hide the column names that are repeated data through the column or sensitive.

MollyR · September 14, 2021, 1:23pm

Additionally, the path with will unique for each row of data too if that is an easier way forward.

bruno29a · September 14, 2021, 1:24pm

Hi @MollyR , no problem, although it makes it harder when we can’t see the data.

Is your “Work Order Number” an incremental number? For example, if you currently have 12345 as your latest Work Order Number, will the next order be greater than 12345?

And what about the Date of… and Time of…, are they the date and time of the Work Order?

bruno29a · September 14, 2021, 1:26pm

Hi @MollyR , while path is unique, what’s really needed here is that the value is incremental. Uniqueness is not really required. It’s nice to have, but it’s the incremental part that’s really critical. That’s what will allow you to determine from what point to retrieve the data.

MollyR · September 14, 2021, 1:29pm

@bruno29a the work order is not always incremental. The time is repetitive because the data is being pulled from multiple places at near the same time. I think date would be our best way forward, it is the date of inspection. There are many rows with the same date because more than one row of data is inspected and recorded on a single day. Our date includes year, month, and day.

bruno29a · September 14, 2021, 1:56pm

Hi @MollyR , yes it’s bound to happen with date where you will have multiple rows with the same date (or timestamp).

With incremental values, when you have uniqueness, you can easily do > (greater than) your last value. When do you NOT have uniqueness, you have to do >= (greater or equal) your last value, because new data with the same date could have come in after your last ingestion.

So, this is what I would suggest:

Get MAX() of the date and time of your “old data” to get the latest date and time
Retrieve records from your data source where date >= the MAX date from #1 and time >= the MAX time from #1
You will have some dupes because of the >=, so apply some dedupe, only for the range of the MAX date and time value - duplications will be possible only for that range

Try this and see how it goes.

MollyR · September 14, 2021, 2:23pm

@bruno29a To get the MAX() of the date would I use a string manipulation? I am relatively newer to KNIME.

bruno29a · September 14, 2021, 3:07pm

Hi @MollyR , there are a few ways to do this. Can you show me an example of date and time that you have so I can see the format that you have?

MollyR · September 14, 2021, 3:13pm

Date: 20210913 (year month day)
Time: 010906 (hour minute second)

mlauber71 · September 14, 2021, 4:07pm

@MollyR you could use a database, like a local H2 DB and a primary key to identify cases you already have inserted. The question then would be what would constitute such an identifier.

The reference row filter also could accomplish that if you would define what constitutes a row you want to exclude if you would do that in KNIME.

Maybe best you come up with sample data and construct a workflow that demonstrates the whole of your challenge without giving away secret information.

bruno29a · September 14, 2021, 5:13pm

Hi @MollyR , sorry for the delayed reply in between, a bit busy today.

Based on the format you gave, the dates and times are stored as string then. If your datasource have these columns as type date and type time, then I would suggest converting your data to date and time. However, I have a strong feeling that they are stored the same way in your datasource, so that’s the assumption I’m making for my solution.

A MAX() on string usually looks at how alphabetically the column is ordered, and based on the formats you provided, this is perfect.

For the sake of my solution, this is what I have as “old data”:

And my datasource is like this:

One way to get the MAX of your string date and string time column is to do a sort descending on these 2 columns, and then take the first column. To do this, I want to create a “key” column, which will be the concatenation of the Date and Time Column, called COMBINED_DATETIME. I then column filter only on this key column to save memory and for sorting to be faster. I sort descending and keep only the first row. I then rename the column to Max_DateTime before I convert it to a variable, that way it will create a variable with the name Max_DateTime:

And here’s my variable:

Now, I can retrieve my data based on that variable:

SELECT "Path", "Program Name", "Date of", "Time of", "Work Order Number"
     , CONCAT("Date of", "Time of") AS Combined_DateTime
FROM "PUBLIC"."datasource"
WHERE CONCAT("Date of", "Time of") >= $${SMax_DateTime}$$

And this will give me these records:

If you look at the first screenshot where we have the “old_data”, we currently have 6 records in the our “old data”, and we’re getting 6 new records from our query, which if we add these 2 sets together, will give us 12 records. However, the second screenshot where we have the datasource, there should be 11 records only. This is expected because we know, as I explained, that there can be duplicates for the exact timestamp that is our starting point of re-ingesting.

We just need to tackle that duplicates.

To do this, I separate the data into 3 parts:

Data < timestamp of starting point
Data = timestamp of starting point (where the duplicates will happen, if any)
Data > timestamp of starting point

I just need to dedupe the dataset #2. Once it’s deduped, I just concatenate all 3 dataset.

We put all these together, and we get this:

Note: In my example, I’m creating a datasource into an H2 virtual (memory) instance. In your case, it can be a different datasource.

Final results:

Here’s the workflow: Get delta data based on date.knwf (42.4 KB)

EDIT: You can see in the results screenshot that, based on the Row ID column that the first 6 columns (Row0 to Row5) are from the original data, and the rest of the Row ID with *_dup are the additions, effectively added only the new records instead of reprocessing the whole table.

MollyR · September 15, 2021, 12:50pm

@bruno29a So I definitely did not do a good job in clarifying, sorry about that. The column names that I sent you are the end product after manipulating the data. The initial data is a path for the file with the format of C:\Department\Equipment\Location\Inspection Type\File Type\Program name. Then I run it through a variable loop to get the full data with a much longer path file that begins as just mentions but continues to include Date, Time, Work Order, Work Center, ect. to get the column list I sent earlier.

Here is the overall look at my workflow.

and here is the variable loop

bruno29a · September 15, 2021, 1:00pm

That’s OK @MollyR , no worries, and thank you for clarifying

So, if I understood correctly, your datasource are the files located in Path, correct? If you are reading from files, I don’t think you can retrieve only the delta (delta is what we call the data you are trying to append). I think you don’t have any choice than loading the whole file vs reading from a db table where you can filter the data with WHERE conditions before retrieving the data.

With file, you have to load the whole file and then apply row filters.

So, you can apply the same logic I used, but instead of reading from datasource using SQL, you would have to read your file, and then apply row filter based on the date.

MollyR · September 15, 2021, 1:20pm

If the row filter is in or after the variable loop, to include the data, won’t the workflow take just as long to load?

bruno29a · September 15, 2021, 1:27pm

You kind of don’t have a choice if you are reading from a file. You can’t query a file, as the file is not running on any engine as opposed to a db table where the database system would do the query for you.

bruno29a · September 15, 2021, 1:32pm

Hi @MollyR , so to summarize:
If you are reading from file, you don’t have much choice than reading the whole file. However, if you are running some manipulation on the data, you can first filter out the data based on the date as we discussed, which should leave you with the delta, in the “raw” state.

At this point, your dataset that you need to manipulate should be very much smaller. You can then process this smaller dataset as opposed to reprocessing data that was already processed. That’s where you would gain.

MollyR · September 15, 2021, 1:47pm

Sounds good. Thank you for your help.

system · March 17, 2022, 1:47am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.