Path error when putting files into MINIO using Minio API from Pyspark.

Hi KNIME Support.

I am trying to connect Livy and MINIO to utilize the MINIO API within Pyspark to export a file to the MINIO specified path.

The file storage path I specified is ‘s3a://result_bucket/Check_reduce.log’, but when I run the Pyspark Script, I get the following error.

FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/spark-3.0.1-bin-hadoop-3.2.0-cloud-scala-2.12/work -dir/s3a:/result_bucket/Check_reduce.log’

I can’t seem to find the path because it is automatically assigned the path “/opt/spark-3.0.1-bin-hadoop-3.2.0-cloud-scala-2.12/work -dir” which I didn’t even specify, do you have any idea why it is assigned? Also, what is the solution?

I’m really in a hurry. A quick answer would be appreciated.

Hi @JaeHwanChoi,

Welcome to the KNIME Community Forum!

Can you post some example code from your PySpark snippet?

Cheers,
Sascha

Hi, @sascha.wolke

Here is the problem code in the Pyspark Script I am using in connection with Livy.

file_path = ‘s3a://minio_result_bucket/make_file_log.txt’
with open(file_path, “w”) as my_file:
my_file.write(“Hello world \n”)
my_file.write(“I hope you’re doing well today \n”)
my_file.write(“This is a text file \n”)
my_file.write(“Have a nice time \n”)

This is example code, I want to create a file in minio, and put a log-like sentence in that file.

I’ve tested a number of things, and while it’s possible to simply export the finished file, it doesn’t seem to automatically create a file in the minio path.

When I run the code, it says it doesn’t find the path to file_path. Of course, I also applied the minio connection path as shown below.

spark.conf.set(“spark.hadoop.fs.s3a.endpoint”, “url”)
spark.conf.set(“spark.hadoop.fs.s3a.access.key”, “key”)
spark.conf.set(“spark.hadoop.fs.s3a.secret.key”, “key” )
spark.conf.set(“spark.hadoop.fs.s3a.path.style.access”, True)
spark.conf.set(“spark.hadoop.fs.s3a.connection.ssl.enabled”, True)
spark.conf.set(“spark.hadoop.fs.s3a.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”)

Any answers would be appreciated.

Hi @JaeHwanChoi,

This is the python build system to write files, but you have to use Spark instead or use the output port and some KNIME node.

df.write.parquet("s3a://minio_result_bucket/some-path.parquet")

Cheers,
Sascha

3 Likes

Hi, @sascha.wolke

Does this mean I can’t build to write files?

After all, you can only export via the output port of Pyspark or Python Script, or export the finished file via the code “df.write.parquet(“s3a://minio_result_bucket/some-path.parquet”)”, right?

Thanks

Hi @JaeHwanChoi,

Your PySpark code is usually executed on some executor inside your cluster.

To write files, you can use the output port of the PySpark node and write the files using KNIME nodes. To improve the performance, don’t use the Spark to Table node, use a write node like Spark to Parquet to write your files.

If you like to do it in code, then you can use your data frame and the Spark writer: Generic Load/Save Functions - Spark 3.5.0 Documentation

Cheers,
Sascha

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.