I am trying to process a very large database of 300,000 BLOB files. The query which pulls from the oracle database takes too long to bring the table in all at once. I’ve been trying to create a chunk procesing workflow that will process around 1,000 files at a time. Converting BLOB to PDF to Text and then uploading the text as a CLOB back into a new table in the database. This takes a long time just to run a batch of 1,000 files at around 1-2 hours. By building this chunl processor I hope that it can continously run until completion. Currently my main issue I am struggling with right now is updating my DB Query Reader node with variables to control the batch size. I was either going to try to us integer flow variables which define the rownum to process that batch. For instance, for every loop the sql query would change from, “WHERE rownum <= 2000 AND rownum > 1000” with each iteration of the loop increasing by 1000. Another method I was considering was to try to control the batch by time an process files based on their creaton data. For instance, I could use a Date & time node that creates variables that increase by the month and process all the files created within that monthly time period.
I just need help understanding how I can get these flow variables into DB Query so that I can control the amount of files being queried because trying to get all 300,000 files from 2019 - 2023 takes an impossibly too long of time to even bring into the table.
@terbe one idea indeed could be to do this in chunks and control that via flow variables. Though the task sounds large indeed. Could you identify what the bottleneck might be? Is it the network speed or the process on your machine? One idea could be to utilize parallelization if the database connection allows for multiple accesses at once.
You also should think about a mechanism to store which entries you already have processed so you could continue at a certain point in case your task fails.