Driver memory specs inquiry when using Pyspark Script

JaeHwanChoi · October 17, 2023, 7:25am

Hello KNIME Support

I would like to inquire about the specs for Pyspark Script.

Currently, we have a workflow that performs ML analysis through big data. There are 2 Pyspark scripts used for preprocessing and modeling.

I am using it in connection with Livy, but when I put about 3 million data in the simplest model, Random Forest, and run it, the Spark driver uses more than 65GB of memory. (1 executor, 2 cores)

Big data bigger than 3 million cases will require a very large memory, is it normal to use so many resources?

Or is it because the output of big data is not deleted when one Pyspark script is finished, but is stored until the end of the analysis, so the memory is used so much?

Can anyone tell me why the driver is using so much memory, or is there a way to reduce it?

Or is there a reference to the approximate Spark resources used per data?

Any help would be greatly appreciated.

sascha.wolke · October 17, 2023, 5:03pm

Hi @JaeHwanChoi,

Are you using the PySpark API to create the model? You can find the documentation and an example here: Classification and regression - Spark 3.3.0 Documentation

Without using the PySpark API, there is no benefit running your code inside Spark/Livy.

Cheers,
Sascha

JaeHwanChoi · October 18, 2023, 1:17am

Hi. @sascha.wolke

Of course you used the analytics code for Pyspark. The flow is similar to the example above, except that I’m running big data.

Do you have any experience running Pyspark scripts based on such big data? Also, I would like to know the memory size of the Driver and Executor used in Spark, is there any reference for this?

I’m asking because it seems to use so much Spark Driver memory (65GB+) to run just 5 million cases that it doesn’t make much sense resource-wise to use Spark nodes.

Thanks

sascha.wolke · October 27, 2023, 3:18pm

Hi @JaeHwanChoi,

Do you have some example code from your PySpark snippet?

This high memory usage in the driver sounds like something goes wrong, the computation and high memory consumption should happen in the executors instead.

What nodes are you using after the PySpark node? Do you export a huge amount of data using the Spark to Table node after your snippet? Maybe double check to write your data using the Spark to Parquet node instead.

Cheers,
Sascha

JaeHwanChoi · October 29, 2023, 5:18am

Thanks for the answer. @sascha.wolke! Unfortunately, I can’t share the code in question as it was developed by a partner, but the algorithms we use are Random Forest and GBM models.

I’m guessing that the high memory usage of the driver is caused by importing and exporting large data. However, the input of Pyspark Script does not use toPandas() and the imported data itself is analyzed in the form of dataframe & rdd.

If so, there is a problem with the output, because when I analyze about 5 million data, the output is 5 million, 3 million, and so on, and I export large data through two ports. At this time, I wrap the data in a spark.dataframe and export it, but I am thinking that the driver usage is overloaded as I go through the process of exporting large data to the output of KNIME’s Pyspark Script only.

Because of this, the derived data already uses a lot of memory for preprocessing and analysis, so Spark to CSV & Spark to Parquet also gives memory errors. Previously, I used Spark to Table, but it also required a lot of memory, so I removed it and changed to Spark to CSV above.

In other words, the question I want to ask is, when analyzing large data with Pyspark Script and exporting large data to the output port, is the driver memory overused by proceeding with spark.dataframe, which is the way KNIME’s Pyspark Script spits out the output, and if so, how can I solve this memory overuse method and spit out large data?

I would be very grateful for a quick response, I’m having major usability issues with too much memory usage.

sascha.wolke · October 30, 2023, 12:56pm

Hi @JaeHwanChoi,

At this time, I wrap the data in a spark.dataframe and export it

This means you create the Random Forest and GBM models using some python framework, and then export it using the Spark API?

I can’t share the code in question as it was developed by a partner

Maybe the partner knows more about the typical Spark memory consumption of his PySpark code.

In other words, the question I want to ask is, when analyzing large data with Pyspark Script and exporting large data to the output port, is the driver memory overused by proceeding with spark.dataframe, which is the way KNIME’s Pyspark Script spits out the output, and if so, how can I solve this memory overuse method and spit out large data?

This is how Apache Spark works, and not related to KNIME. You have to make sure to keep your data in Spark all the time. Otherwise, there is no benefit from using Spark.

There is a Spark Random Forest Learner node in KNIME that does not require any custom PySpark code: Spark Random Forest Learner – KNIME Community Hub

Cheers,
Sascha

system · January 28, 2024, 12:56pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.