How to process big data into table to spark node

DerekJin · July 29, 2019, 8:44am

Hi,
I am the new of using spark big data extension. I have a large json file and want to transfer this knime table of json into spark data frame.

I have already created a spark job server like this:

But when I input my json table to ‘table to spark’ node, I get error ‘java heap space’ since this node need to serializing json data table to tmp file.

I know there is the way to change the memory cache of Knime, but it is not the solution if I have larger data need to be processed.
So I would like to ask that if there is any way in knime that allowing users to process big data into spark dataframe?

mareike.hoeger · July 30, 2019, 12:43pm

Hi @DerekJin,
The “Table to Spark” node should only be used for testing purposes, it cannot handle big tables.
The way to go is to upload the JSON file into your cluster file system (e.g. HDFS), you can do this with the Upload node.
Afterwards you can use the HDFS path to read the the file in Spark with the “JSON to Spark”

best regards Mareike

DerekJin · July 30, 2019, 3:15pm

Thank you for your advice, that is really help.

system · August 6, 2019, 3:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.