Spark To Parquet : write to S3 bucket

When processing data using Hadoop (HDP 2.6.) cluster I try to perform write to S3 (e.g. Spark to Parquet, Spark to ORC or Spark to CSV). Knime shows that operation succeeded but I cannot see files written to the defined destination while performing “aws s3 ls” or by using “S3 File Picker” node. Instead of that there are written proper files named “block_{string_of_numbers}” to the main folder in the S3 bucket.

block_files

Additionally there is created a file structure in other not defined “noname” directory, but with right file name
file_structure

These files show up as 1 KB

What is even more strange , when using “Parquet to Spark” I can read this file from the proper target destination (defined in the “Spark to Parquet” node) but as I mentioned I cannot see this file by using “S3 File Picker” node or “aws s3 ls” command.

Summing up the name of the file and its target folder in S3 bucket are different than indication in the nodes.
I’m using Knime 3.5.3.

According to Hortonworks docs https://hortonworks.com/tutorial/manage-files-on-hdfs-via-cli-ambari-files-view/section/3/ , prefix: s3a:// enables Hadoop to access S3 storage. However when configuring S3 connection, there is root URL defined as s3://access_key@region .
May it be related to this issue?

Any suggestions would be appreciated

Hi @mmm

ORC and Parquet “files” are usually folders (hence “file” is a bit of misnomer). This has to do with the parallel reading and writing of DataFrame partitions that Spark does.

On top of that, S3 is not a real file system, but an object store. S3 only knows two things: buckets and objects (inside buckets). Physically, there is no such thing as “folders” inside a bucket. But if an object’s ID contains a forward slash, most S3 browsing tools (including KNIME) will display a folder hierarchy inside the bucket.

When you use “Spark to ORC” and tell it to write into S3 at “/X/Z” what is actually getting created is

  1. A bucket called “X” (if you have the permissions and if it does not already exist)
  2. A number of objects inside that bucket with ID “Z/part-somenumber-somerandomstring.orc”.

If you just put “/spark_orc” into the dialog of the Spark to ORC node, then it will try to create a bucket called “spark_orc” and then put objects with ID “part-somenumber-somerandomstring.orc” in there. I guess this is what happened here.

Best,
Björn

Hi @bjoern.lohrmann
Thank you for your response.
I think this is not exactly what happened. Let me clarify with some screenshots.

The Dialog of the Spark to ORC node looks like this:

The output of that is written to the defined bucket “/…-prod-hortonworks-0/” in the following way:

So the files containing some data (block_somenumberstring) are written to the bucket beyond the defined structure.
block_files_ls

Structure defined in the Dialog node contains empty files:

Summing up, the problem is that files that include data (“block_somenumberstring”) are written with undefined names and are not located in the defined object (“folder”)

Kind regards

Hi @bjoern.lohrmann , I just would like to ask if you managed to take a look at that ? :slight_smile:
I would appreciate any suggestions
Kind regards

Hi @mmm

sorry for the long delay. Which Spark version are you using?

I am not sure how to go about this. All we are doing to write the ORC/Parquet files is to pass a URL of the form:

dataFrame.write().format(“orc”).save(“s3://…-prod-hortonworks-0/spark_orc_folder/spark_orc_name”)

(we are not passing the access_key@region stuff into the S3 URL. Things like the access key should already be part of your cluster’s hdfs-site.xml)

If that doesn’t work properly this could be an issue in how Spark writes to S3.

Best,
Björn

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.