Knime continuos reading of data in hdfs

Hi Team,

I am looking for a solution to meet the following requirements:

  1. Data arrives in hdfs folder on a near real time basis. Folder structure is created on a daily basis i.e., one folder per day
  2. Knime workflow has to continuosly listen to this folder(current day’s folder) and process the data in downstream nodes.
  3. Workflow is continuous and never ending. How to make the workflow never ending?
  4. What happens in case of failure of workflow ? How to handle check points. Already read files need not be re-processed.

Could some one from knime community help on this ?

Thanks
Sudheer

This sounds like a job for the KNINE server - running every few minutes depending on the time the processing would take. You could create a variable with the current and last day and list the current files in HDFS then process them.

You could then either store the names of the already processed files in a table or check with the target (table/database) if the file has been processed and skip them. I have realized such processes with the reference row filter excluding already done files.

The process would ‘self-heal’ if it crashes and next time it starts would do the outstanding jobs.

Another idea could be to use external tables in Hive eg. I am not 100% sure how they would handle new incoming data sometimes. You would have to try on your machine. You could check out these examples:

1 Like

Hi @mlauber71 ,

Thanks for the pointers. I agree with checkpointing to a separate table and it solves some part of the problem.
Also, I should have elaborated my requirement a bit more.
Every 5minutes workflow should get triggered. As soon as the workflow reaches the last node, workflow should automatically start again from begining after waiting for 5minutes. Assuming the workflow execution is complete in <5min.
What is the way to do this ? This is kind of an infinite loop.

I think you could try to use a loop where the end condition is never met and combine it with a wait node.

I think you will have to try out several settings. And you should consider other settings. If this just runs on local machine or a server.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.