I am trying to make a workflow that reads lots of big xlsx files from a folder. The reading times are not ok.
One file can have around 20-25 sheets with 3000 each file. By now I have 50 files and growing each day.
I tried with this workflow, without success:
I have to work via vpn which makes the proccessing slower than normal.
There would not be any problem with this. I could convert to .csv. The files only have text and numbers.
But I think I need the loop. I am not fully sure that every file contains the same number of sheets.
As far as I know, I am the only person using Knime that I can reach within the company, so I would have to ask for it.
As a matter of fact, I would like to try and teach this software, but projects like this usually die fast.
Maybe I can implement this. Thank you!
Downloading improves the speed but it is not enough.
I have to make other imports daily with files that can be bigger than each unique sheet within the file. It is the combination of several sheets inside each file which makes the loading painfully slow, I think. The reader has to read 20 sheets per file.
Could I improve the speed extracting each sheet from every file, converting it to .csv and then reading every file (which would have only 1 sheet inside)?
I.e: Which one is the fastest strategy? Reading every sheet inside the files or reading one sheet per file? Knowing that, if I implement the last option I could have around 20 files per 1 original file.
With the number of files I manage right now, almost 1000 files with 3000 rows each.
I have been through this problem before too and one of the bottlenecks I experienced was not only when reading the excel sheets one by one but specially with the -Read Excel Sheet Names- node.
As far as I understood, the java library used by the -Read Excel Sheet Names- node reads the whole Excel document just to figure out the names of the Excel Sheets, which eventually turns out to be quite inefficient.
Have you measured the time every node takes to execute in your example ? Is it also the case for you (the Read Excel Sheet Names- node takes as long as the execution of the loop) ?
Yep, the Read Excel Sheet Names node takes a little time but since it is outside the loop it is not a big deal. Maybe 60 seconds. In contrast with the Excel reader… around six minutes per load. I.e: 5 hours when I get to the 50 files mark.
This is the reason why I wanted a simple loop. As minimal as possible. Only reading inside the loop.
I think I will try making one csv per sheet and downloading to local.
Keep you posted?