How to read/write data in HDFS using "KNIME Spark Executor" extension?

In the description of "KNIME Spark Executor", it is clearly stated that this extension can read/write data in HDFS. Yet I haven't been able to figure out how.

So far, I can only access HDFS using HDFS connector from "Big Data Connectors" extension.

Any help? Thanks in advance.

Hi,

currently, the only way to do this is via the "Spark Java Snippet (Source)" node. If you add the node, open the configure dialog and then select the Templates tab, there is a template to read a plaintext file from HDFS. If you select the template an click on "Apply" you can edit the HDFS path as well as the template code. Note, that this template does not yet parse the file into columns (it uses the whole row as the column value).

What kind of file are you trying to read?

I apologize for the inconvenience, we are currently looking at how we can add a proper Spark File Reader node that does parsing for some standard cases, such as csv or parquet files.

Best,

Björn Lohrmann

Thanks for your reply, bjoern.lohrmann.

Currently we are trying to read csv files from HDFS, and turn them into KNIME tables. Just like what CSV reader node does.

I will take a look at the Spark Java Snippet you suggested.

Hi,

I think for your case (csv file) we even have an example workflow that can be used as a starting point. You can find the example in the Node Guide:

https://www.knime.org/nodeguide/big-data/spark-executor/modularized-spark-scripting

All node guides are also directly available from the example server in KNIME Analytics Platform.

Best,

Björn

Great! I will check it out. Thanks!