Spark to Parquet node not working <resolved>

rghadge · July 11, 2017, 6:43pm

Hi

I have used couple of hive to Spark nodes followed by join and now I want ot store the result of this into HDFS in parquet format. But when I execute Spark to Parquet node I am getting following error. -

ERROR Spark to Parquet 4:74 org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 11.0 failed 4 times; aborting job
ERROR Spark to Parquet 4:74 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation: Aborting job.
ERROR Spark to Parquet 4:74 org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Job job_201707111010_0000 aborted.
ERROR Spark to Parquet 4:74 Execute failed: Failed to create output path with name 'hdfs://cluster-01.example.com:8020/user/rghadge/temp/spark_test_parquet'. Reason: Job aborted.

Please advise.

Thanks,

Rahul G.

bjoern.lohrmann · July 13, 2017, 4:09pm

Hi Rahul,

since a ticket with a workflow screenshot was opened for this in parallel in our support system, I suppose this is the same issue as in the ticket?

If so, I noticed you were using Spark Joiner before Spark to Parquet. There is a known issue in Spark joiner when you use "Filter duplicates" under duplicate column handling. The effect is that succeeding nodes can fail with the error msg you are seeing. You can work around the Spark joiner issue by manually selecting the input columns that you want in the join output.

Hope that helps,

Björn

rghadge · July 19, 2017, 9:36pm

Hi Bjoern,

Yes it is the same one. Thak you very much.

-RG

system · June 2, 2023, 9:03pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.