Fast reading

jorgemartcaam · December 1, 2021, 4:37pm

Hi there,

I am trying to make a workflow that reads lots of big xlsx files from a folder. The reading times are not ok.
One file can have around 20-25 sheets with 3000 each file. By now I have 50 files and growing each day.
I tried with this workflow, without success:

Problems:
Mediocre pc.
I have to work via vpn which makes the proccessing slower than normal.

Any idea?

Daniel_Weikert · December 1, 2021, 6:11pm

Hi
Loops in general are slow.
Can you get the data into a different format?
br

izaychik63 · December 1, 2021, 8:22pm

Excel Reader from KINIME 4.3 is streamable. Try to convert Reader in component and stream.

mlauber71 · December 1, 2021, 11:33pm

@jorgemartcaam you could try and first download the files (maybe one by one) and then import them

Also there have been other debates about Excel import with big files. Useb the newest KNIME version or try R.

jorgemartcaam · December 2, 2021, 10:29am

There would not be any problem with this. I could convert to .csv. The files only have text and numbers.
But I think I need the loop. I am not fully sure that every file contains the same number of sheets.

jorgemartcaam · December 2, 2021, 10:30am

I think the problem lies within the processing speed. I am not sure streaming would improve things a lot, but I will try.
Thanks for your suggestion!

jorgemartcaam · December 2, 2021, 10:33am

Downloading to local, you mean? That is one option and I think it would improve things a lot. I do prefer to have them in the original folder, though.

About this, do you know if the speed is better than xlsx. reader?

Thanks!

morpheus · December 2, 2021, 12:52pm

Hi @jorgemartcaam,

do you have the option working with remote a connection to computer which is located within the company network environment and running Knime on this compouter? I guess this is the fasted option.

I’m working with this setup without any performance issues.

BR

jorgemartcaam · December 2, 2021, 2:14pm

Good idea!
As far as I know, I am the only person using Knime that I can reach within the company, so I would have to ask for it.
As a matter of fact, I would like to try and teach this software, but projects like this usually die fast.
Maybe I can implement this. Thank you!

mlauber71 · December 2, 2021, 2:44pm

Yes. You might also try to ZIP your file, download it and then try to import it. Depening on the version of Excel there might already be an internal compression.

The question would be where the bottleneck is. Is it in the import or is it the VPN connection. You might have to try.

jorgemartcaam · December 2, 2021, 4:11pm

Downloading improves the speed but it is not enough.

I have to make other imports daily with files that can be bigger than each unique sheet within the file. It is the combination of several sheets inside each file which makes the loading painfully slow, I think. The reader has to read 20 sheets per file.

Could I improve the speed extracting each sheet from every file, converting it to .csv and then reading every file (which would have only 1 sheet inside)?

I.e: Which one is the fastest strategy? Reading every sheet inside the files or reading one sheet per file? Knowing that, if I implement the last option I could have around 20 files per 1 original file.
With the number of files I manage right now, almost 1000 files with 3000 rows each.

What do you think?

Daniel_Weikert · December 2, 2021, 6:14pm

I still would argue csv would be faster then excel. How much I can’t specify
best of luck

aworker · December 2, 2021, 6:25pm

Hi @jorgemartcaam

I have been through this problem before too and one of the bottlenecks I experienced was not only when reading the excel sheets one by one but specially with the -Read Excel Sheet Names- node.

As far as I understood, the java library used by the -Read Excel Sheet Names- node reads the whole Excel document just to figure out the names of the Excel Sheets, which eventually turns out to be quite inefficient.

Have you measured the time every node takes to execute in your example ? Is it also the case for you (the Read Excel Sheet Names- node takes as long as the execution of the loop) ?

Best

Ael

jorgemartcaam · December 3, 2021, 7:23am

Hi @aworker,

Yep, the Read Excel Sheet Names node takes a little time but since it is outside the loop it is not a big deal. Maybe 60 seconds. In contrast with the Excel reader… around six minutes per load. I.e: 5 hours when I get to the 50 files mark.
This is the reason why I wanted a simple loop. As minimal as possible. Only reading inside the loop.

I think I will try making one csv per sheet and downloading to local.
Keep you posted?

jorgemartcaam · December 3, 2021, 12:07pm

So, fast update.
At the end, as usually happens, the solution was to attack a big problem making it smaller. What I ended up doing was this:

1º node: I need the Read Excel Sheet Names node for the loop inside the metanodes.

Metanode, where the magic happens:
1º I filter by the sheet I want. In the first case, Abanca:

2º I give the indication via the loop to the Excel reader of what I need it to read. I find this step to be highly important. It has to read “file”, not “files in folder” via the flow variable.

After that, the rest of the workflow is prepping. As you can see, the tedious part is the set up. I have to make one per sheet, whis is slow and manual, but once I have it ready, it is so much faster.

Suggestions to automate it further would be nice!

Thank you everyone for the ideas.

system · December 10, 2021, 12:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.