Parquet Writer - Array size exceeds VM limit

DXRX · June 2, 2020, 6:38pm

Im trying to write a knime table (11mil rows 77 columns) to S3 but the writer nodes fails with the message “Execute failed: Requested array size exceeds VM limit” its the same case for the ORC writer node. Im running a EC2 instance with 256mb ram and 36 cores.

Snowy · June 2, 2020, 7:51pm

256 mb RAM? Is that a typo? Do you mean 256 GB? One suggestion I would have is maybe work with smaller batch sizes, and make sure the memory policy for the node is writing tables to disc rather than chaching the tables in memory.

DXRX · June 2, 2020, 7:56pm

Yea its GB, I got the memory consumption down to 18% and it still gives the same error. It seems to be a problem outside of the memory consumption.

ipazin · June 3, 2020, 1:48pm

Hi there @DXRX,

does it work as expected with less rows?

Both Parquet Writer and ORC Writer nodes have Chunk Upload tab when writing data to remote destinations. Can you try decreasing maximum size of an individual chunk and/or number of local chunks?

Br,
Ivan

DXRX · June 4, 2020, 1:59pm

Yea I got to the bottom of it, it turned out to be a data issue just giving a misleading error message. I adjusted the maximum chunk size without any joy, tried it on the local big data environment and S3, still no joy.
I partitioned the data in two parts, the first part went through the second half failed which indicated it resided in the data. I re-read the data with the line reader to identify the row causing the issue, there was a single " used, short lines and over multiple line enabled which threw the parsing of the file reader out…

ipazin · June 4, 2020, 2:12pm

Hi there @DXRX,

nice approach. Glad you figured it out. Will mark is as solved.

Br,
Ivan

sascha.wolke · June 4, 2020, 3:07pm

What node are you using to write the data? I guess you do not re-read the parquet files with the Line Reader node. Can you post a workflow with a small data set with the problematic lines to reproduce the problem? Sound like the error message is very misleading as you already mention.

File formats like CSV does not work well with large data and data that contains special characters like new lines. The Parquet Writer or ORC Writer might be the better one in this case.

Cheers
Sascha

DXRX · June 4, 2020, 3:55pm

Hi sascha,
The problem I had was when reading a txt file data with the file reader node and then writing to S3 with the parquet and orc writer nodes. I just used the line reader to view the problematic rows and debug.

But yea it was strange how both the parquet writer and orc writer nodes gave the same misleading error message

Thanks!

sascha.wolke · June 8, 2020, 9:07am

Hi @DXRX,

You mean that the file reader node succeeds, but the following Parquet or ORC writer node fails?

Do you have some test data to reproduce the problem?

Cheers
Sascha

system · June 15, 2020, 9:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.