Reading AVRO files in KNIME

Hi,

This is a continuation of the issue I raised here: AVRO file reader?
I’m afraid I still can’t load local AVRO files using the provided solution. The Avro to Spark node throws the following error:

2020-10-28 14:01:12,427 : ERROR : KNIME-Worker-12-Avro to Spark 0:3 : : Node : Avro to Spark : 0:3 : Execute failed: An error occured. For details see View > Open KNIME log.
java.lang.NullPointerException
at org.knime.bigdata.spark2_4.api.TypeConverters.getConverter(TypeConverters.java:121)
at org.knime.bigdata.spark2_4.api.TypeConverters.convertSpec(TypeConverters.java:162)
at org.knime.bigdata.spark2_4.jobs.genericdatasource.GenericDataSource2SparkJob.runJob(GenericDataSource2SparkJob.java:82)
at org.knime.bigdata.spark2_4.jobs.genericdatasource.GenericDataSource2SparkJob.runJob(GenericDataSource2SparkJob.java:1)
at org.knime.bigdata.spark.local.wrapper.LocalSparkWrapperImpl.runJob(LocalSparkWrapperImpl.java:127)
at org.knime.bigdata.spark.local.context.LocalSparkJobController.lambda$1(LocalSparkJobController.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I’m not sure I’ll be able to post the AVRO I’m trying to process here, because it’s semi-sensitive and is around 300MB. I might be able to post PyCharm’s interpretation of the schema portion.

As an aside, I still think it would be easier if we had a Read AVRO node that just uses the relevant apache libraries, rather than mucking about with spark.

Cheers,

Richard

Hi @rsherhod -

We do have a couple of tickets open about this feature request for reading local AVRO files (AP-8056; BD-573). I’ll add a +1 from you on those tickets. Sorry for the trouble.

2 Likes

Do you have a sample file I could try?

@Daniel_Weikert - I’ll try to find one that doesn’t contain proprietary data.

@ScottF - That would be really useful. Presumably such a node would use the schema portion of the AVRO to define a table spec, and then the contents of the file would be written out into columns? It would also be useful if it could just output the contents as a JSON column. Most JSON data I encounter would produce way more columns than I’d ever need, which is why I rarely use the JSON to Table node. Instead I use the excellent JSON Path node, to be selective about what I extract. The same would be true for AVRO.

Hi @rsherhod,

you can try the Spark DataFrame Java Snippet (Source) and run the following code (replace /tmp/sample.avro with the real path):

final Dataset<Row> df = spark.read().format("com.databricks.spark.avro").load("/tmp/sample.avro");
logWarn("Schema:\n" + df.schema().treeString());
return df;

Looks like your AVRO file contains data types that are not support by KNIME. Some sample data or the schema would be nice. The schema should be logged to the KNIME console if you run the code with the Local Big Data Environment.

Cheers
Sacha

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.