I have a table that is the result of an SQL query (a join between Table A and Table B).
I want to keep a full history of this table, but only update:
the last 3 months of Table A
and the last 10 days of Table B
each time I run the workflow.
So far, I’ve written the full historical table once, and created a new workflow for updates. In that workflow, I apply WHERE clauses to filter the two tables and write a temporary table.
But I’m not sure what to do next to update the historical table properly.
could you please help meee
if you could share some sample data that would greatly help in providing any level of support. In case it contains sensitive date, you can easily anonymize it:
Here is the complete SQL query that creates this table Thankkk you in advance*
SELECT
COALESCE(a.col1, b.col1) AS col1,
COALESCE(a.col2, b.col2) AS col2,
COALESCE(a.col3, b.col3) AS col3,
…
SUM(COALESCE(a.metric1, 0)) AS metric1,
SUM(COALESCE(a.metric2, 0)) AS metric2,
…
SUM(COALESCE(b.metric3, 0)) AS metric3,
SUM(COALESCE(b.metric4, 0)) AS metric4,
MAX(a.date1) AS date1,
MAX(a.date2) AS date2
FROM (
SELECT
col1, col2, col3, …
SUM(metric1) AS metric1,
SUM(metric2) AS metric2,
…
MAX(date1) AS date1
FROM schema.table_a
GROUP BY col1, col2, col3, …
) a
FULL JOIN (
SELECT
col1, col2, col3, …
SUM(metric3) AS metric3,
SUM(metric4) AS metric4,
MAX(date3) AS date2
FROM schema.table_b
GROUP BY col1, col2, col3, …
) b
ON
a.col1 = b.col1 AND
a.col2 = b.col2 AND
…
GROUP BY
COALESCE(a.col1, b.col1),
COALESCE(a.col2, b.col2),
…
Thank you so much for your help , i will see the documents thats you send me
I use Impala db
The sample data :
-Table A ( fraud & Unpaid Transactions)
-Table B ( Authorization Metrics )
The output look like this :
So we are into BigData territory. Typically, you will not be able to just update data in Hive/Impala table but that may very much depend on the underlying data structure. If it is CSV, Parquet, ORC or Kudu or something else. With Kudu you might be able to perform traditional update.
Otherwise you will have to come up with a strategy to append the data (best use Partitions) and then either delete older data or use a Query with a WHERE clause (often using Window functions) to retrieve the latest data you want to have.
Maybe it would be best to develop a minimal sample or adapt one from the hub to demonstrate what you want to do. You can create a local big data environment so you can see if your code will work.
This is a demonstration of how for example a reporting system can work on a big data cluster: