I am contacting you because I am experiencing an error when using Pyspark using the “Create Local Big Data Environment” node in my personal PC environment.
After creating a Spark context with the “Create Local Big Data Environment” node, connecting to the Pyspark Script and configuring the Spark Session again, the following error occurs.
“C:\Users.eclipse\903338280_win32_win32_x86_64\plugins\org.knime.bigdata.spark.local_4.7.0.v202211082334\libs\pyspark.zip\pyspark\sql\context.py:125: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.”
I see that SparkSession is deprecated, but I know that SparkSession is a higher version than Spark context. Also, I need to configure Session inside the script to use spark-related code, so it is essential to proceed, but I configure context with “Create Local Big Data Environment”, so I get a double configuration error?
Or is there a problem with the configuration of Create Local Big Data Environment in Local?
What steps should I take to resolve the above error? Your help will be greatly appreciated.
I am currently working on integrating with minIO within a Pyspark Script node.
The above error “Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead” was fixed by modifying the “org.knime.bigdata.spark.local_4.7.0.v” folder inside KNIME, but I got the following error.
Traceback (most recent call last):
File “C:\Users\JaeHwan\AppData\Local\Temp\pythonScript_87ffb01b_cd3f_4c05_bf09_a803072190c3585250129945783620.py”, line 79, in
df = spark.read.csv(‘s3a://my_bucket/test.csv’, header=True)
File “C:\Users\JaeHwan.eclipse\903338280_win32_win32_x86_64\plugins\org.knime.bigdata.spark.local_4.7.0.v202211082334\libs\pyspark.zip\pyspark\sql\readwriter.py”, line 410, in csv
File “C:\Users\JaeHwan.eclipse\903338280_win32_win32_x86_64\plugins\org.knime.bigdata.spark.local_4.7.0.v202211082334\libs\py4j-0.10.9.3-src.zip\py4j\java_gateway.py”, line 1321, in call
File “C:\Users\JaeHwan.eclipse\903338280_win32_win32_x86_64\plugins\org.knime.bigdata.spark.local_4.7.0.v202211082334\libs\pyspark.zip\pyspark\sql\utils.py”, line 111, in deco
File “C:\Users\JaeHwan.eclipse\903338280_win32_win32_x86_64\plugins\org.knime.bigdata.spark.local_4.7.0.v202211082334\libs\py4j-0.10.9.3-src.zip\py4j\protocol.py”, line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o56.csv.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
Crucially, when I run the same code through the Pyspark module in the Jupyter notebook, it successfully connects to minIO, but when I import it into the Pyspark Script in KNIME, I still get the error.
In addition, you said that you can specify the Spark configuration in the Advanced tab of the Create Local Big Data Environment Node dialog box, but I set it as shown below.
The Hadoop version, shipped with the Local Big Data Environment, is slightly outdated and not recommended to read things from S3. You might switch to a real Spark cluster, or use S3/CSV nodes from KNIME.
You said that it is not recommended to connect with S3 because the Hadoop version provided by ‘Create Local Big Data Environment’ is not the latest version, so I understand that there is no way to connect with S3 through ‘Create Local Big Data Environment’ at this time?
Also, you mentioned that we can use .spark for the Python script source node. If we want to make additional settings for this spark, is it possible to customize it in the Advanced tab of ‘Create a local big data environment’?
I used the pyspark code inside the Python Script, but when I run that Workflow on the Server, does it use Python resources instead of Spark resources?
Also, if I use “Orc to Spark” or “Spark to Orc”, the PySpark Script only outputs a Table type to connect, but I can’t outputs a Port like Pickeld Object as a Table Port in Pyspark Scirpt like I can in Python Script, right?
Note that the Local Big Data Environment node is only about playing around with Spark, it is not recommended in production. You should use the Create Spark Context (Livy) instead and connect to a real Spark cluster in production.
On a real cluster, you can use the CSV to Spark node or PySpark code to read from S3/MinIO.
The error is that when I run the code to integrate with minio in the Pyspark Script associated with the “Create Local Big Data Environment” node, I get the Script error message “Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead”.
This didn’t happen in my local Jupyter environment, so when I compared the jar file used in my local Spark configuration with the jar file for the KNIME plugins, there was a big difference in version.
So I’m thinking it’s a version issue with this jar file.
KNIME contains already a S3 connector, and this is the way you should read files from S3/MinIO in KNIME. The Local Big Data Environment does not support S3.
Pyspark module in the Jupyter notebook
How do you run your Jupyter notebooks, do you have a working Spark cluster? If so, please use the Create Spark Context (Livy) or Create Databricks Environment nodes instead of the Local Big Data node! I guess this might solve all the mentioned problems, and you should be able to read from S3 in a Spark cluster setup.
If you don’t have a cluster, please consider using the normal Python nodes and the S3 connectors to read the data. You should never use the Local Big Data environment in production.