I have a workflow which is not to complex and in total it takes 20-30 seconds to execute. When I then add nodes which should write to CSV and Excel files a major issue occurs. Generally I don’t want to write to CSV but I had to use it as an option in my case.
I have a table of approximately 260.000 rows and 25 columns. There was never a problem during all node executions until I start to write to a new Excel file. This table size took me 50 minutes to write ending up in an Excel file of 26MB. I then added CSV writing node to see if it went quicker, and it did. It was done in about 10-15 seconds and finished with a size of 57MB.
This slow speed of writing to Excel feel very wierd. I tried to increase the xmg in the Knime.ini file to 12111MB (I have 16GB RAM) and also the number of cells in memory to 10.000.000 but the speed did not improve at all. When I opened up taskmanager in Win10 to see if it was using 12GB of memory it was only using 1.5GB. What is going on here and is there some way to speed this extremely slow speed up?
Currently I feel like I could have done the entire operations with Excel formular or VBA within that writing time that it took me in this case. If I can’t speed it up I don’t see the benefit of continuing to use Knime.
Knime.ini file content:
It does feel very weird. As to my knowledge writing to csv file should be faster than writing to Excel file but 50 minutes vs 15 seconds means something is not good.
Couple of questions. Standard data types? Same location for csv and Excel? Locally? Antivirus program scanning?
I have tested Excel Writer node and it took me less than a minute to write down 260.000 rows and 24 columns. I have attached the workflow so you can download it and test it. Location of Excel file will be in workspace directory in output folder.
2019_05_14_Writing_Excel_File.knwf (11.9 KB)
As to your modifications in KNIME configuration file they are fine but won’t help much in this case.
Are you saving as XLS or XLSX? We use Apache POI for the Excel reading and writing libraries; XLS is very memory heavy - no way around that with those libraries - but i believe we try to use the “streaming” version of the XLSX writer.
Thanks for you response!
I have input and output folders within the workflow folder structure so they are all close to each other. I have so far only used input files which are either CSV or XLSX format. I have all of it saved on my local drive. The company I work has for sure installed som kind of antivirus program scanning and I am not an admin user to my computer.
I tested your example and the Excel writer took 23 seconds according to the Timer Info node. Don’t know why yours performs so much quicker. What can I look into? My entire workflow is fine apart from that last piece.
I would go for CSV in this case, but it actually messes up all the formats I have. When I import the CSV to Excel all integers become strings (in my country “,” is used in numbers and not “.” as a decimal) and the unique article numbers which might start with “000” all turn into numbers where the zeros are removed in front. I guess there is little I can do to format the CSV and also switch decimal standards in KNIME to “,”.
It could be a complex filtering of the input data. Say, you filter 260K rows from 3 Milion rows.
In this case you still work with the set of 3 M rows but by references. Use Cache node after this type of filtering.
Considering example workflow I provided writes just fine I would try @izaychik63 suggestion if you have such case. My assumption that this behavior is connected with your data or workflow/nodes configuration. If you are able to share workflow or data (or only part of data) we can test it and perhaps come to the bottom of this.
Additionally you can always check your KNIME log file. Maybe some indication can be there.
You could try some different export formats besides the original KNIME table format. Excel I think is quite slow because it has to be specifically processes. If you want to have better performance you might have to use some streaming or a different format. Most of the formats I compared would preserve your variable formats.
R is there as a test if this might handle your memory issues in a different way. To speed things up obviously it would not make sense to first send the data to R and the export it.
aside: yes I am aware a true comparison would need the machine to run the exact same tasks while doing this. So it is more an estimation but maybe still useful.
Expanded @ipazin workflow:
2019_05_14_Writing_Excel_File.knwf (409.2 KB)
Thanks for the support, I really appreciate it! I used the Cache node a bit here and there.
I actually managed to figure out what was slowing down the Excel writing and I brought it down to 10-15 seconds. I deselected “Autosize Columns” in the Layout section within Configuration and it just magically made it so much quicker. Almost a bit insane. But I am really happy to have found this small detail which caused the slow writing.
However, I would be interested to also learn about the other export formats which can work and preserve variable formats. Do you know if there is a site with more detailed info on different writing operations and how one can use them? Thinking also of the comment made by @mlauber71 about different ways.
Excel is a very tricky beast and something I’ve become very wary of. It can, and will, silently mess with your data to make it fit its idea of what it is. This has tripped up a few serious scientific studies, check out these references:
Erlwood and Continental have Excel formatting nodes.
surprised auto-sizing took that long but glad you figured it out
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.