Does KNIME intelligently recognize and process only new information in a CSV file within a workflow, or does it process all rows regardless of their modification status? I’m looking for insights on how KNIME handles incremental loading and processing of data in a CSV file, specifically in scenarios where there is no timestamp column available for tracking modifications. Any suggestions or best practices for achieving efficient processing of only new or modified data in KNIME would be greatly appreciated!
How does KNIME handle processing a CSV file in a workflow? Does it recognize old information and optimize the loading to process only new information?
Hello @bremels and welcome to the KNIME community
You can have a look to the following workflow for insights. The case is built with Excel files but the mechanics should be similar with using CSV nodes.
Let me know if further clarifications are needed.
I am re-reading your post, and maybe my workflow may not be helpful; in the case that you are working with a re-written file, then it is not the same case than the provided workflow, however some techniques can be used in the same way (hashing / outer join / compare).
If the case (csv without uid column) I would reconsider your data acquisition processes.
So more than KNIME intelligently can recognize differences, it is more about the processing that you want to apply, and how to approach it with KNIME.
Therefore, depending on your data structure, and csv user creativity; a valid method to compare tables would be to concatenate (Column Aggregator) the whole row and hash the concatenated text (in the two csv’s comparing time frames). The only hashing algorithm accesible in KNIME that I have in mind is MD5. You can find it in String Manipulation node. If the complexity of the data requires a more robust hashing (SHA-256…), you would probably need to code it in R/Py.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.