KNIME 4.5.2 Excel Reader extremely Slow when reading from other sheet different than the first

I’ve the same issue since the upgrade to the latest Knime version 4.5.2. I also deal heavely with Excel Sheets and the execution time just exploded. Processing before was in the minutes (dependend on the hardware I used 3 up to 6 minutes), now it’s on my fastest desktop 45 minutes! This kills me and I found no way (option) to speed this up. I invested a lot of time and logic to minimize file access (to speed up the process) by using the multiple tab option of the “new” Excel reader and writer. This worked fines until last week. I did not went as far as my “fellow sufferer” and did not test if the first read (or write) works fast but what I can approve is, that multi-tab Excel performance is worser than worse. I added (as you proposed) a timer info to one of the Excel writer components, the result is attached
Timer_info.txt (10.2 KB)

2 Likes

Hi @marcogertsch,
thanks for the additional information. So far I could not reproduce the issue. Can you share your KNIME workflow? If that is not possible, can you share the node settings of the excel nodes?
Also, did you change to the knime.ini file in any way (like setting a custom xmx value)?

2 Likes

Hello, here is the sample workflow, Im not sending full workflow, just the section that reads the excel files,…

the same workflow has three excel files as samples, just execute it, its mapped with variants, you will notice that the speed of reading its extremelly slow, even more thinking that these excels have less than 1000 lines each.

Slow reading test.knar (2.4 MB)

if I roll back to 4.4.X the workflow runs just perfect. as @marcogertsch stated nothing has changed… Im also sending an screenshot if my KNIME.ini so you can take a look, I didnt execute any change on it.

1 Like

Just to support my buddy. I measured the performance of the Excel writer within my current workflow. The node has the target to write 6 Excel files with the same structure but different data but the same 10 tabs. As mentioned before, over christmas I redesigned the workflow in so it collects and prepares all tabs and writes them then in one rush. This helped my overcome of sync issues with my OneDrive (wirting to a file just beeing synchronized failes). The Excel writer shows the following progress:
00:23 > 00%
00:52 > 02%
01:15 > 04%
01:34 > 12%
01:49 > 16%
02:03 > 51%
02:36 > 54%
02:50 > 55%
03:05 > 75%
03:11 > 100%
So under the bottomline the process to write those 6 files endures approximately 18 Minutes. Each of those files is 700 kB so no big deal. In the early days, this part of the process lasted less than a minute.

2 Likes

I want to share my workflow with you, but the limit here is 4MB and the files is 22MB. Can I send it to you by mail. As the workflow contains customer data this option would also be prefered, as I don’t want to share this public. So can you provide me an alternative data exchange option? Many Thanks!

1 Like

Thank you @GQRanalytics and @marcogertsch for providing the workflows. While I could not reproduce the increased execution time between 4.5.2 and 4.4.4, the execution time of the Excel Reader and Writer nodes is nonetheless fairly high (especially because the files are relatively small). I therefore opened a ticket with our development team and will update this topic when I get feedback.

4 Likes

thanks! I will be waiting, my customer is running with 4.4.x version, and I have a pending change to upgrade their workflow.

hellow @joergwas do you know how many time will take to have a workarround?

No, at the moment I cannot give a concrete timeline for this. I will update this post when I have news.

Hello, do you happen to have a clue on this? I just installed the 4.6.0 and the issue remains, It only runs well in 4.3 version… Im kinda frustrated here, can you please help me out?

Have you tried running them in sequence and use the garbage collector?

1 Like

Hello Daniel, the problem start with just configuring the excel node, as soom as I use the option of sheet 2/3 etc the reader acts reaaally slow…but if I leave first sheet it works just perfect… the garbage collector wont help here ;(

@GQRanalytics I tried a few things and indeed the node takes quite some time to load. My impression is this is due to

  • empty formats running down the sheets
  • cells are protected
  • simultaneous access to the same file

My guess is that the current version of Excel Reader has a problem with protected cells and their content. Not sure why that is.

What you could do is try and use R/RStudio or Python to import the data (not very elegant I know). I think Pandas has a generic Excel reader and the latest KNIME version can have Python (with Pandas) bundled - so this might be an interim solution without having to install additional software.

1 Like

@GQRanalytics if you absolutely must you could indeed employ Python and OpenPyxl (with pandas) to import your data pretty fast. Unfortunately OpenPyxl is not (yet?) in the standard environment of KNIME 4.6 native python integration so my example still uses Conda Environment propagation.

aside: @carstenhaubold this would be another case for having OpenPyxl on board :wink:

kn_example_python_excel_import_knime.knwf (796.4 KB)

This approach might be (well) over engineered for your task but once Python integration has been updated there might be additional use cases that might easily be done just by KNIME and its Python node.

3 Likes

Iam currently investigating what is the cause of this, on KNIME side as well as on Excel. What I found out is that if you display all columns (currently a lot of them are hidden) it reads the files very fast. So you could try to unhide all columns which can be quiet time consuming and annoying. Another workaround for the time being is to use XLS files instead of XLSX or use the option “Reevaluate formulas” in the Advanced settings tab if possible (here the file is treated as a XLS File). For the latter approach you should be aware that the whole file needs to be loaded in memory which can cause some problems if the files are big.

Either way, we hope we can we can find a solution as soon as possible!

I think it has something to do with the format of the File(s) which we (KNIME) or the underlying used library does not like. I could not reproduce the issue by creating an own file which hides a lot of columns. Do you use some special formatting or something like that?

Best regards
Lars

5 Likes

Hello @laaaarsi I followed your guidelines, uncheck the skip hidden columns/rows and check the revaluate formula and now its getting up to the normal speed… for now I think this has solved my issue but would be nice to review why other KNIME version works perfect with previous settings :slight_smile: thanks again for the help!

4 Likes

You are welcome! To correct myself, I did not see any speedup when unchecking skip hidden columns/rows within KNIME but unhiding columns within the sheet did.
Yes we are currently working on finding out what have changed!

2 Likes

actually those hidden checks alone dont do the trick, needs also the reevaluate formulas check, and I mean it run a little faster in the previous KNIME Version, but with the current execution time is not representing major delays to my client… thanks again and I will be on top of you future response in case there is another update, thanks again!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Hello,
we just release version 4.7 of the KNIME Analytics Platform which removes the performance problem with hidden columns.
For an overview of all the new features have a look at the What’s new page. For all the details go to the changelog.
Bye
Tobias

1 Like