In KNIME, create a workflow that filters and joins only the last 3 months of Table A and the last 10 days of Table B, removes overlapping records from the historical table, appends the new data, and overwrites the historical

I have a table that is the result of an SQL query (a join between Table A and Table B).
I want to keep a full history of this table, but only update:

  • the last 3 months of Table A
  • and the last 10 days of Table B
    each time I run the workflow.

So far, I’ve written the full historical table once, and created a new workflow for updates. In that workflow, I apply WHERE clauses to filter the two tables and write a temporary table.
But I’m not sure what to do next to update the historical table properly.
could you please help meee

Hi @doha_El_khattab,

if you could share some sample data that would greatly help in providing any level of support. In case it contains sensitive date, you can easily anonymize it:

Best
Mike

4 Likes

Here is the complete SQL query that creates this table
Thankkk you in advance*
SELECT
COALESCE(a.col1, b.col1) AS col1,
COALESCE(a.col2, b.col2) AS col2,
COALESCE(a.col3, b.col3) AS col3,

SUM(COALESCE(a.metric1, 0)) AS metric1,
SUM(COALESCE(a.metric2, 0)) AS metric2,

SUM(COALESCE(b.metric3, 0)) AS metric3,
SUM(COALESCE(b.metric4, 0)) AS metric4,
MAX(a.date1) AS date1,
MAX(a.date2) AS date2
FROM (
SELECT
col1, col2, col3, …
SUM(metric1) AS metric1,
SUM(metric2) AS metric2,

MAX(date1) AS date1
FROM schema.table_a
GROUP BY col1, col2, col3, …
) a
FULL JOIN (
SELECT
col1, col2, col3, …
SUM(metric3) AS metric3,
SUM(metric4) AS metric4,
MAX(date3) AS date2
FROM schema.table_b
GROUP BY col1, col2, col3, …
) b
ON
a.col1 = b.col1 AND
a.col2 = b.col2 AND

GROUP BY
COALESCE(a.col1, b.col1),
COALESCE(a.col2, b.col2),

I’ve created the historical table with partition : year and month i think i should add day also …

@doha_El_khattab question is what SQL database you are using. Maybe you can provide some sample data and demonstrate what you already have.

Concerning updates you can take a look at these examples:

1 Like

Thank you so much for your help , i will see the documents thats you send me
I use Impala db
The sample data :
-Table A ( fraud & Unpaid Transactions)
-Table B ( Authorization Metrics )
The output look like this :

sale_date issuing_bank acquiring_bank issuing_payment_method acquiring_payment_method number_of_fraud_cases amount_of_fraud number_of_unpaid_transactions amount_of_unpaid_transactions fraud_extraction_date unpaid_extraction_date number_of_authorizations amount_of_authorizations authorization_extraction_date
2023-07-01 BankA BankX Visa Visa 2.0 100.0 1.0 30.0 2023-07-10 2023-07-10 10.0 500.0 2023-07-10
2023-07-02 BankB BankY MasterCard MasterCard 1.0 50.0 0.0 0.0 2023-07-11 2023-07-11 0.0 0.0 None
2023-07-03 BankC BankZ Amex Amex 0.0 0.0 0.0 0.0 None None 5.0 250.0 2023-07-12

So we are into BigData territory. Typically, you will not be able to just update data in Hive/Impala table but that may very much depend on the underlying data structure. If it is CSV, Parquet, ORC or Kudu or something else. With Kudu you might be able to perform traditional update.

Otherwise you will have to come up with a strategy to append the data (best use Partitions) and then either delete older data or use a Query with a WHERE clause (often using Window functions) to retrieve the latest data you want to have.

Maybe it would be best to develop a minimal sample or adapt one from the hub to demonstrate what you want to do. You can create a local big data environment so you can see if your code will work.

This is a demonstration of how for example a reporting system can work on a big data cluster:

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.