Large memory consumption issue with "Table to Spark" node

JaeHwanChoi · October 26, 2023, 2:43pm

Hello KNIME Support.

I have a question about memory issues with Spark to Table nodes.

I am using Spark via Livy in my current project. I want to take 10 million parquet data and convert it to Table to Spark, then Spark to Table again and convert it to csv.

However, even if I set both Spark Driver and Executor in Livy to sufficient memory (100 / 120), I get the following error message Out of memory.

The example WF above was created for a simple error, and in reality, there are many preprocessing and analyzing nodes in between, including Pyspark Script.

The answer I’d like to hear is, does the original Table to Spark consume a ton of memory, and what method should I use to transform it within the given memory?

I need to export to CSV unconditionally. There is no other option, any help would be appreciated.

sascha.wolke · October 26, 2023, 2:53pm

Hi @JaeHwanChoi,

The Spark to Table nodes should not be used to exchange large amounts of data.

You can use the Spark to CSV node to write the CSV inside the cluster or e.g. the Spark to Parquet node. You can read the parquet file then into KNIME, but I guess you like to use the Spark nodes to process your data in the cluster and only return aggregated data.

Cheers,
Sascha

system · January 24, 2024, 2:54pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.