This sounds like a job for the KNINE server - running every few minutes depending on the time the processing would take. You could create a variable with the current and last day and list the current files in HDFS then process them.
You could then either store the names of the already processed files in a table or check with the target (table/database) if the file has been processed and skip them. I have realized such processes with the reference row filter excluding already done files.
The process would ‘self-heal’ if it crashes and next time it starts would do the outstanding jobs.
Another idea could be to use external tables in Hive eg. I am not 100% sure how they would handle new incoming data sometimes. You would have to try on your machine. You could check out these examples:
Thanks for the pointers. I agree with checkpointing to a separate table and it solves some part of the problem.
Also, I should have elaborated my requirement a bit more.
Every 5minutes workflow should get triggered. As soon as the workflow reaches the last node, workflow should automatically start again from begining after waiting for 5minutes. Assuming the workflow execution is complete in <5min.
What is the way to do this ? This is kind of an infinite loop.