I am currently working on a project for a customer and we are using Spark via Livy in a K8S environment.
Due to deep learning & machine learning models not supported by Spark nodes, we are using Pyspark Script by necessity.
I don’t know if it is a node defect in Pyspark Script or for some reason,
I preprocess & analyze 30 million data through Script (about 3 hours) and write four less than 1 million resulting datasets to Minio within the Script code.
As a temporary test, I tried writing 2 datasets of less than 1 million to Spark to Parquet & Spark to ORC via Pyspark Script Output Port, but this also takes a lot of time.
Is this a node defect in Pyspark Script? Or do we need to understand the current network conditions of the customer? I would like to know what is causing the poor performance.
Furthermore, is there any way to improve the performance of writing?
I’m sorry if I’m missing any information. Exporting 4 datasets of less than 1 million data points to MINIO via API from code within a script takes about 10 hours.
There might be many reasons why your PySpark script is slow- Can you provide more details?
Can you post at least a snapshot of the workflow if you cannot share the code? As a quick test, you might like to do the following:
Replace the PySpark script node with a new one, and use the default code in the snippet. This means no custom code at all, KNIME nodes only, and check if this goes faster.
Replace any Table to Spark or Spark to Table nodes with Parquet to Spark and Spark to Parquet nodes, to read/write the data inside the cluster.
Make sure the input data has enough partitions to read it in parallel. Depending on the size of your input data, it might be useful to split this into multiple files to read it in parallel.
Use the Spark Repartition node to increase the partition. This depends on your PySpark script and number of executors.