Excel Writer and Sheet Appender nodes VERY slow to write

sirish · November 1, 2018, 4:14pm

I have noticed that recently the Excel Writer and Excel Sheet Appender nodes write extremely slowly compared with nodes like the CSV writer. For example, writing 30k rows x 15 columns takes a few seconds using CSV writer, but over 15 minutes using either Excel node.

I tried using the CSV node to write to a file with an .XLSX extension, which goes very fast, but doesn’t open as it says the file is corrupted.

Any ideas on what could be causing this?

Thank you.

aborg · November 1, 2018, 11:50pm

I think the Excel Writer node is not capable of streaming writing (practically write row-by-row), so it needs to create the whole xls(x) file model in-memory. If the memory assigned to KNIME is close to the memory used during xls(x) writing, the process will be very slow. You can try assigning more memory.

(csv and xlsx are very different file formats, no surprises that renaming is not enough.)

borjaserrano · September 19, 2019, 11:13am

Old question, but check the “Auto-size columns” option which makes things go very slow

borjaserrano · September 19, 2019, 12:02pm

And just one other thing. It is important that the Excel file you are opening is not big, otherwise it will take some time to load as well as Apache POI, the underlying library that KNIME uses, opens the whole file in memory each time you use the node.

zedleb · April 8, 2020, 10:00am

Hi All,

Reviving this topic with a specific query regarding writing over a VPN to a network.

Due to the current situation and having to work from home, this workflow is now causing considerable difficulties as it tries to write over a VPN to my work network.

For comparison writing the Excel file locally takes ~7seconds.

Now the execution is more like 45-60 minutes!

I have noticed that often the Excel Writer ‘pauses’ at 80% for a long time, I see the same 80% pause when writing locally but obviously it’s for a split-second.

Our network can be painfully slow at a distance but not so bad as to take 60 mins to save a 5mb file.

A Knimer (Ana) asked me to check whether our network is mounted via samba. I’ve had this response so far (translated from French). He said he has forwarded the question to the server admin.

We are pure Windows, unix / linux only use on esx via VMware ESX

Samba is a set of protocols to run unix / linux and windows together.

And what I don’t know is the proprietary communication of ESX (physical server that runs all of the brecey / granville servers)

Any ideas/suggestions to try - most appreciated!
(For info - the autosize columns is unchecked )

Here’s an output of Timer Info node

RowID	92	92	92	1	1	4:142:0:124	org.knime.base.node.preproc.rowkey2.RowKeyNodeModel2
Sorter	14	14	14	1	1	4:142:0:125	org.knime.base.node.preproc.sorter.SorterNodeModel
Row Filter	2	2	2	1	1	4:142:0:130	org.knime.base.node.preproc.filter.row.RowFilterNodeModel
Concatenate	0	0	0	1	1	4:142:0:132	org.knime.base.node.preproc.append.row.AppendedRowsNodeModel
CSV Writer	490180	490180	490180	1	1	4:142:0:135	org.knime.base.node.io.csvwriter.CSVWriterNodeModel
Rule-based Row Filter	91	91	91	1	1	4:142:0:136	org.knime.base.node.rules.engine.RuleEngineFilterNodeModel
Component Input	0	0	1	1	2	4:142:0:139	org.knime.core.node.workflow.virtual.subnode.VirtualSubNodeInputNodeModel
Component Output	0	0	1	1	2	4:142:0:140	org.knime.core.node.workflow.virtual.subnode.VirtualSubNodeOutputNodeModel
Excel Reader (XLS)	12819	12819	12819	1	1	4:142:0:144	org.knime.ext.poi2.node.read4.XLSReaderNodeModel
Table Row to Variable	0	0	0	1	1	4:142:0:145	org.knime.base.node.flowvariable.tablerowtovariable2.TableToVariable2NodeModel
Excel Reader (XLS)	78539	78539	78539	1	1	4:142:0:147	org.knime.ext.poi2.node.read4.XLSReaderNodeModel
Excel Writer (XLS)	4235530	4235530	4235530	1	1	4:142:0:149	org.knime.ext.poi2.node.write3.XLSWriter2NodeModel
Excel Writer (XLS)	10	10	10	1	1	4:142:0:150	org.knime.ext.poi2.node.write3.XLSWriter2NodeModel
Timer Info	7	0	7	0	1	4:142:0:151	org.knime.base.node.util.timerinfo.TimerinfoNodeModel

izaychik63 · April 8, 2020, 12:28pm

First - do not use Autosize Columns option. And try to save locally and copy result later to the location.

zedleb · April 8, 2020, 1:16pm

Thanks Izaychik, I’ll try the Copy/Move Files (didn’t know of that Node)
Autosize is not checked (I read elsewhere that that could slow things down)

zedleb · April 8, 2020, 3:14pm

OK I tried with Copy/Move Files - it was a little faster but not dramatically. And this is with one of the smaller files I need to create too (only 1mb)

I did a manual copy in Explorer and you can see how painfully slow the transfer speed is…so I guess it does come down to a slow network.

Any other suggestions how to overcome this bottleneck?

Perhaps the way I’m attempting to do the workflow is faulty…my aim is to replace the last 8 days of production in a base csv file (this is done on a daily basis with an new extract of the last 8 days - this way we catch changes to the production database that have been backdated within the last week).

The problem is that the daily 8-day extract is in Excel and I had no end of fun with the format of dates between the two file formats…

mlauber71 · April 11, 2020, 5:39am

Could it be an idea to ZIP the files upload them an unzip them? The unzipping would have to take place on the remote system.

Next idea could be too use parquet format which already can be stored compressed and can be reused which is also true of knime tables themselves.

Next idea could be to work with chunks or partitions. Here also parquet could play a role along with the local big data nodes of knime. Maybe not the most straightforward solution but maybe you can think about that.

Then you might have to think about the whole setup again and get an idea when to do what with which chunk of data. If you elaborate more on your setup we could try to brainstorm some.

zedleb · April 15, 2020, 4:47pm

Hi @mlauber71 thanks so much for taking the time to reply.

I tried zipping a file today (in another workflow that needed attention - with a very large csv and ticking the compression option). Unfortunately, trying to unzip it over my VPN has been running for over an hour and reportedly still has 1.5 hours to go! So, though the writing of the compressed csv was much quicker - it doesn’t solve the outstanding problem. In other words even if I zip the outputs I’ve got the network issue.

I have used the parquet format, and used the copy/move node to achieve writing the parquet file to my network (which is not directly possible in the parquet writer). This indeed may be an option.

I agree that I need to revisit the whole setup as the reason I’m trying to save in Excel is because of the difficulties I had with date formats being incompatibly read back in from a saved csv. Maybe parquet will work better as a base file with new Excel read data.

My network limitations at the moment make development extremely challenging as testing anything takes so long I’ve nearly forgotten what I was testing by the time the outputs are ready! I may just have to abandon until I off the VPN and back in the office…

I will keep the thread updated with developments.
Thanks again for your interest and suggestions!

izaychik63 · April 15, 2020, 4:55pm

@zedleb, the slow network may be a result of antivirus. As for .xls file zipping is not suppose to work as .xls file is a zipped .xml file. Possibly you can fix date format in KNIME and create new CSV to be used without problem. Anyway all these are not a KNIME issues.

ana_ved · April 16, 2020, 7:32am

Hi @zedleb

If writing a csv file is faster in our network, I agree with @izaychik63 that fixing the date formats in a way compatible to your csv should be a good solution. Can you describe a bit more how is your date format and why it is not compatible?

zedleb · May 6, 2020, 1:13pm

Hi all

I’ve had a chance to work properly on the workflow and yes - I’ve got it working well at last. Compressing the csv is a great help too.

The date formats were an incredible headache but more because I had to read and write and each time I would get very confused. I resolved this by having a permanent clean import and then a local and network read and write (during the development)

I would like to keep the options to read and write both locally and to/from the network and thought I could put the CSV writer outside the component but don’t know how to put the File Readers (which read the compressed CSVs) outside the component or someway of allowing a selection outside of the component.

I’ll keep researching this - but if anyone has a quick solution it’s most appreciated!

Thanks for the all the support.
I have a separate question about File Reader but I will put that as a new issue

zedleb · May 15, 2020, 7:23am

Thought I’d give a quick update on the speed of the workflow now I had an opportunity to run it directly on the network and not via the VPN.

I have 4 identical workflows 2 of which take production data and 2 which take stoppages (each from one of our sites).

I ran all 4 together and it took 9.16 seconds!

So we could say a bit of an improvement