Only update new files

Hello

I have a workflow that starts with a “List Files” node that is taking a folder with over 1600 .xml files to create a table. When I started the workflow it was not a problem given that I had a few xml files.

As time went on, the number of files grew, thus making the workflow a lot slower to complete.

I was hoping that someone could share a way to only process the “new xml files” that are added so that the workflow could be a lot faster. I have no idea if what I am asking is possible but I decided to ask anyways.

Thank you in advance

It’s almost certainly possible, but how to do it depends on how you will know that they are ‘new’? Are you looking simply at a file creation date-based approach, or is there some other way of looking up with files are already processed (e.g. from a database, a list of output files etc)?

Also, how are you reading them? If you are using a Table Row To Variable Loop Start -> XML Reader comobination after your List Files node, then almost certainly it will be much quicker to use the Load Text-Based Filesnode (from the Vernalis community contribution - see https://hub.knime.com/Vernalis/extensions/com.vernalis.knime.feature/latest/com.vernalis.nodes.io.txt.LoadTxtNodeFactory or https://nodepit.com/node/com.vernalis.nodes.io.txt.LoadTxtNodeFactory) followed by a String to XML node

Steve

4 Likes

Thank you Steve. I will check it out!

I can think of two ways

  • Check the date and time of the XML files and only use the ones with a date greater than your last run of the workflow
  • Create a table that stores all the file names you have imported and only import the new ones
4 Likes

Hi @stevens_albert and welcome back to the KNIME community forum,

Regarding @mlauber71’s second suggestion, You can export the output of the List Files node in a file and read it each time you run the workflow. Then use the Reference Row Filter node to exclude those files which are already listed in the exported file.

:blush:

4 Likes

Thank you very much for your reply. I will try this and let you know!

How can I get the date of my last run?

Thanks

Thank you! I will try that and let you know!

You could use the:

A few more examples how to handle date and time variables. You might convert your time of execution to a number like:
201910041753

and store it with your data and later simple use a Rule engine to filter out cases with older timestamps.

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.