Processing hundreds of millions of records

A few remarks since it is difficult to judge from the screenshots:

  • if you have to process that many files and lines it could make sense to do it in chunks and have some method in place to log the current status and be able to restart at a certain point. Eg. write results to disk after each 50 out of 900 processed files or something. Because if you do not have a very powerful infrastructure, compartmentalization might be a thing for you
  • also this saving of steps in-between might function as some sort of progress-bar. You write the current status into a separate CSV file and maybe even some time statistics. Sometimes it makes such tasks easier if you could have an idea how long it might take (or you could give someone who is waiting for results an estimation)
  • see that the power you have is used in an optimal way (like Iris said) avoid unnecessary steps and keep the steps manageable for the power you have. Streaming might help you will have to see about the size of the chunks
  • see if all the nodes can just run in memory that might speed up things
  • see if you can assign more RAM and maybe a higher number of threads working in parallel
  • if you might encounter changing formats it could be good to just import things as strings first (and convert numbers maybe later)
  • to access a large chunk of files (of the same structure) simultaneously also sounds like a job for a Big Data environment (Hive, Impala external table). If you really must you can think about setting up such an environment in an AWS cluster or something
  • also see that you follow all the advises to empower your KNIME environment (https://www.knime.com/blog/optimizing-knime-workflows-for-performance, Large data tables missing?)
  • if your graphical KNIME interface gets stuck you might think about running KNIME in batch mode (and combine it with the logging and ability to restart the process cf. above)
2 Likes