Reading AVRO files in KNIME

rsherhod · October 28, 2020, 2:31pm

Hi,

This is a continuation of the issue I raised here: AVRO file reader?
I’m afraid I still can’t load local AVRO files using the provided solution. The Avro to Spark node throws the following error:

2020-10-28 14:01:12,427 : ERROR : KNIME-Worker-12-Avro to Spark 0:3 : : Node : Avro to Spark : 0:3 : Execute failed: An error occured. For details see View > Open KNIME log.
java.lang.NullPointerException
at org.knime.bigdata.spark2_4.api.TypeConverters.getConverter(TypeConverters.java:121)
at org.knime.bigdata.spark2_4.api.TypeConverters.convertSpec(TypeConverters.java:162)
at org.knime.bigdata.spark2_4.jobs.genericdatasource.GenericDataSource2SparkJob.runJob(GenericDataSource2SparkJob.java:82)
at org.knime.bigdata.spark2_4.jobs.genericdatasource.GenericDataSource2SparkJob.runJob(GenericDataSource2SparkJob.java:1)
at org.knime.bigdata.spark.local.wrapper.LocalSparkWrapperImpl.runJob(LocalSparkWrapperImpl.java:127)
at org.knime.bigdata.spark.local.context.LocalSparkJobController.lambda$1(LocalSparkJobController.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I’m not sure I’ll be able to post the AVRO I’m trying to process here, because it’s semi-sensitive and is around 300MB. I might be able to post PyCharm’s interpretation of the schema portion.

As an aside, I still think it would be easier if we had a Read AVRO node that just uses the relevant apache libraries, rather than mucking about with spark.

Cheers,

Richard

ScottF · October 28, 2020, 3:00pm

Hi @rsherhod -

We do have a couple of tickets open about this feature request for reading local AVRO files (AP-8056; BD-573). I’ll add a +1 from you on those tickets. Sorry for the trouble.

Daniel_Weikert · October 28, 2020, 7:26pm

Do you have a sample file I could try?

rsherhod · October 29, 2020, 9:09am

@Daniel_Weikert - I’ll try to find one that doesn’t contain proprietary data.

@ScottF - That would be really useful. Presumably such a node would use the schema portion of the AVRO to define a table spec, and then the contents of the file would be written out into columns? It would also be useful if it could just output the contents as a JSON column. Most JSON data I encounter would produce way more columns than I’d ever need, which is why I rarely use the JSON to Table node. Instead I use the excellent JSON Path node, to be selective about what I extract. The same would be true for AVRO.

sascha.wolke · October 29, 2020, 3:07pm

Hi @rsherhod,

you can try the Spark DataFrame Java Snippet (Source) and run the following code (replace /tmp/sample.avro with the real path):

final Dataset<Row> df = spark.read().format("com.databricks.spark.avro").load("/tmp/sample.avro");
logWarn("Schema:\n" + df.schema().treeString());
return df;

Looks like your AVRO file contains data types that are not support by KNIME. Some sample data or the schema would be nice. The schema should be logged to the KNIME console if you run the code with the Local Big Data Environment.

Cheers
Sacha

system · April 30, 2021, 3:07am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.