Hello guys, I realized of an issue related to the Excel reader node, this just happens with the KNIME versions 4.5.X
When you read an excel using the “first sheet option” it works just fine, but when you select any other sheet, it turns extremelly slow, as an example I have a project that reads about 100 excel files, when I use the KNIME version 4.4.x it takes about 5 minutes to complete total workflow execution, however with the new version 4.5.X (1 and 2) is taking almost an hour to just read the same files, after the reading the workflow works perfect.
I did some troubleshooting to descart other possible options, such as different PCs, different networks, different file locations, everithing worked slow, but when I tryied to just simply read first sheet it worked fine.
The workflow by it selve is fine, as stated using the same workflow in a previous KNIME version it works great…
Have anybody else noted this? any options to deal with this?
thank you for your input.
I was trying to recreate your issue with 3 excel reader nodes that read the same table from 3 different sheets in a single excel file. One excel reader node is configured with “select first sheet with data” and the other 2 excel reader nodes are configured with “select sheet with name:”. I timed the execution time with the “Timer Info” Node. However, the difference in execution time was only a few hundred milliseconds at most (each table contains 1 mil. rows). Is the setup I used similar to yours? Also, could you please execute the “Timer Info” node (Timer Info – KNIME Hub) as well and share the results?
Additionally, when you updated your KNIME version, did by chance some custom settings get overwritten in the knime.ini file (like the xmx value)?
I’ve the same issue since the upgrade to the latest Knime version 4.5.2. I also deal heavely with Excel Sheets and the execution time just exploded. Processing before was in the minutes (dependend on the hardware I used 3 up to 6 minutes), now it’s on my fastest desktop 45 minutes! This kills me and I found no way (option) to speed this up. I invested a lot of time and logic to minimize file access (to speed up the process) by using the multiple tab option of the “new” Excel reader and writer. This worked fines until last week. I did not went as far as my “fellow sufferer” and did not test if the first read (or write) works fast but what I can approve is, that multi-tab Excel performance is worser than worse. I added (as you proposed) a timer info to one of the Excel writer components, the result is attached
Timer_info.txt (10.2 KB)
thanks for the additional information. So far I could not reproduce the issue. Can you share your KNIME workflow? If that is not possible, can you share the node settings of the excel nodes?
Also, did you change to the knime.ini file in any way (like setting a custom xmx value)?
Hello, here is the sample workflow, Im not sending full workflow, just the section that reads the excel files,…
the same workflow has three excel files as samples, just execute it, its mapped with variants, you will notice that the speed of reading its extremelly slow, even more thinking that these excels have less than 1000 lines each.
Slow reading test.knar (2.4 MB)
if I roll back to 4.4.X the workflow runs just perfect. as @marcogertsch stated nothing has changed… Im also sending an screenshot if my KNIME.ini so you can take a look, I didnt execute any change on it.
Just to support my buddy. I measured the performance of the Excel writer within my current workflow. The node has the target to write 6 Excel files with the same structure but different data but the same 10 tabs. As mentioned before, over christmas I redesigned the workflow in so it collects and prepares all tabs and writes them then in one rush. This helped my overcome of sync issues with my OneDrive (wirting to a file just beeing synchronized failes). The Excel writer shows the following progress:
00:23 > 00%
00:52 > 02%
01:15 > 04%
01:34 > 12%
01:49 > 16%
02:03 > 51%
02:36 > 54%
02:50 > 55%
03:05 > 75%
03:11 > 100%
So under the bottomline the process to write those 6 files endures approximately 18 Minutes. Each of those files is 700 kB so no big deal. In the early days, this part of the process lasted less than a minute.
I want to share my workflow with you, but the limit here is 4MB and the files is 22MB. Can I send it to you by mail. As the workflow contains customer data this option would also be prefered, as I don’t want to share this public. So can you provide me an alternative data exchange option? Many Thanks!
Thank you @GQRanalytics and @marcogertsch for providing the workflows. While I could not reproduce the increased execution time between 4.5.2 and 4.4.4, the execution time of the Excel Reader and Writer nodes is nonetheless fairly high (especially because the files are relatively small). I therefore opened a ticket with our development team and will update this topic when I get feedback.
thanks! I will be waiting, my customer is running with 4.4.x version, and I have a pending change to upgrade their workflow.
hellow @joergwas do you know how many time will take to have a workarround?
No, at the moment I cannot give a concrete timeline for this. I will update this post when I have news.
Hello, do you happen to have a clue on this? I just installed the 4.6.0 and the issue remains, It only runs well in 4.3 version… Im kinda frustrated here, can you please help me out?
Have you tried running them in sequence and use the garbage collector?
Hello Daniel, the problem start with just configuring the excel node, as soom as I use the option of sheet 2/3 etc the reader acts reaaally slow…but if I leave first sheet it works just perfect… the garbage collector wont help here ;(
@GQRanalytics I tried a few things and indeed the node takes quite some time to load. My impression is this is due to
- empty formats running down the sheets
- cells are protected
- simultaneous access to the same file
My guess is that the current version of Excel Reader has a problem with protected cells and their content. Not sure why that is.
What you could do is try and use R/RStudio or Python to import the data (not very elegant I know). I think Pandas has a generic Excel reader and the latest KNIME version can have Python (with Pandas) bundled - so this might be an interim solution without having to install additional software.
@GQRanalytics if you absolutely must you could indeed employ Python and OpenPyxl (with pandas) to import your data pretty fast. Unfortunately OpenPyxl is not (yet?) in the standard environment of KNIME 4.6 native python integration so my example still uses Conda Environment propagation.
aside: @carstenhaubold this would be another case for having OpenPyxl on board
kn_example_python_excel_import_knime.knwf (796.4 KB)
This approach might be (well) over engineered for your task but once Python integration has been updated there might be additional use cases that might easily be done just by KNIME and its Python node.
Iam currently investigating what is the cause of this, on KNIME side as well as on Excel. What I found out is that if you display all columns (currently a lot of them are hidden) it reads the files very fast. So you could try to unhide all columns which can be quiet time consuming and annoying. Another workaround for the time being is to use XLS files instead of XLSX or use the option “Reevaluate formulas” in the Advanced settings tab if possible (here the file is treated as a XLS File). For the latter approach you should be aware that the whole file needs to be loaded in memory which can cause some problems if the files are big.
Either way, we hope we can we can find a solution as soon as possible!
I think it has something to do with the format of the File(s) which we (KNIME) or the underlying used library does not like. I could not reproduce the issue by creating an own file which hides a lot of columns. Do you use some special formatting or something like that?
Hello @laaaarsi I followed your guidelines, uncheck the skip hidden columns/rows and check the revaluate formula and now its getting up to the normal speed… for now I think this has solved my issue but would be nice to review why other KNIME version works perfect with previous settings thanks again for the help!
You are welcome! To correct myself, I did not see any speedup when unchecking skip hidden columns/rows within KNIME but unhiding columns within the sheet did.
Yes we are currently working on finding out what have changed!
actually those hidden checks alone dont do the trick, needs also the reevaluate formulas check, and I mean it run a little faster in the previous KNIME Version, but with the current execution time is not representing major delays to my client… thanks again and I will be on top of you future response in case there is another update, thanks again!