Invalid class exception in CSV to Spark node

06mickey · October 5, 2020, 12:01pm

Hii,
I am getting below error while executing ‘csv to spark’ node …

ERROR CSV to Spark 0:3 Execute failed: org.apache.spark.sql.execution.FileSourceScanExec; local class incompatible: stream classdesc serialVersionUID = 1920947604238219635, local class serialVersionUID = -3589590085483687218 (InvalidClassException)

Also attaching log file and workflow screenshots…knime_log.txt (19.3 KB)

bjoern.lohrmann · October 5, 2020, 2:05pm

Hi @06mickey

looking at the knime.log (thanks for providing), it looks like some Spark internal data serialization between Spark driver and executor is failing.

This really only happens if you are using different Java and/or Spark versions between your cluster nodes. These must all be the exact same versions.

Could that be the case?

Best,
Björn

06mickey · October 5, 2020, 3:22pm

I m using spark version 2.4 and livy 0.7 and knime latest version (4.2.1) , If these are not stable versions of spark & livy for knime 4.2.1 , plaese let me know the correct ones.

sascha.wolke · October 5, 2020, 3:44pm

Hi @06mickey,

Spark 2.4 and Livy 0.7 should work fine with KNIME 4.2.1. What type and version of cluster your are using? (CDH/HDP/AWS EMR/Azure…)

mlauber71 · October 5, 2020, 3:47pm

Could you first load the data into KNIME and see what types are there and what the Type mapping does say. Or if there are strange values with NaN(not a number).

06mickey · October 5, 2020, 5:58pm

There are only string and integer dtypes in my data and no missing values…
still the same error

mlauber71 · October 5, 2020, 6:03pm

Have you tried converting the integers to doubles (I am aware this sounds counter intuitive) and then upload the data. Maybe start with the numbers only.

And could you check into what type of variable your integers would be converted.

06mickey · October 5, 2020, 6:26pm

have converted all the integer column to doubles & tried it , still the same error persists…

mlauber71 · October 5, 2020, 6:32pm

Did you use the node:

You could set the LOG level to DEBUG and try again. Maybe we could get an idea. It you could share a sample of the data that is failing that might also help.

06mickey · October 5, 2020, 6:53pm

are you talking about using ‘Table to spark’ node instead of ‘csv to spark’ node?

mlauber71 · October 5, 2020, 6:58pm

Yes I do. This could give you more control off the formats. BTW: how did you convert the integers to doubles anyway.

06mickey · October 5, 2020, 7:00pm

with the help of column rename node

mlauber71 · October 5, 2020, 7:02pm

Column rename is notorious for messing up formats. You might want to use another converter. But I am not sure it is about integers or doubles.

You might have to map integers to bigint in a big data environment.

06mickey · October 5, 2020, 7:30pm

Ok i ll try that too , but for now can you tell me how to connect table to spark node with ‘Hdfs connection’ node & ‘spark context’ node simultaneously which is possible with ‘csv to spark’ node as shown in my workflow…because the csv file I want to use is on hdfs .

mlauber71 · October 5, 2020, 8:33pm

Then you could try and tell Hive or Impala to treat it as an external table and load that into spark. You could declare the formats as string, int or bigint or double.

(The green block on top)

mlauber71 · October 6, 2020, 5:28am

Has to be the part beneath the green block

system · April 6, 2021, 5:28pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.