KNIME parquet reader adds an empty list to first row if list column is missing its value

Little frustration I just found from writing a parquet file with a List(Collection of String) column that for the first several rows was missing a value, then reading that file, and trying to write it again - all using standard v4.7.2 Parquet Reader/Writer nodes. The issue is that after writing the parquet file and reading it back in, the first row has an empty array as the value for the list column instead of a missing cell. Trying to write the parquet file fails with stack trace:

2021-06-02 16:38:16,481 : ERROR : KNIME-Worker-25-Parquet Writer 0:86:124 : : Node : Parquet Writer : 0:86:124 : Execute failed: Data type mapping exception
org.knime.bigdata.fileformats.utility.BigDataFileFormatException: Data type mapping exception
at org.knime.bigdata.fileformats.parquet.writer.DataRowWriteSupport.write(DataRowWriteSupport.java:153)
at org.knime.bigdata.fileformats.parquet.writer.DataRowWriteSupport.write(DataRowWriteSupport.java:1)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)
at org.knime.bigdata.fileformats.parquet.writer.ParquetKNIMEWriter.writeRow(ParquetKNIMEWriter.java:172)
at org.knime.bigdata.fileformats.node.writer.FileFormatWriterNodeModel.writeToFile(FileFormatWriterNodeModel.java:404)
at org.knime.bigdata.fileformats.node.writer.FileFormatWriterNodeModel.writeRowInput(FileFormatWriterNodeModel.java:356)
at org.knime.bigdata.fileformats.node.writer.FileFormatWriterNodeModel.execute(FileFormatWriterNodeModel.java:275)
at org.knime.bigdata.fileformats.node.writer.FileFormatWriterNodeModel.execute(FileFormatWriterNodeModel.java:1)
at org.knime.core.node.NodeModel.executeModel(NodeModel.java:576)
at org.knime.core.node.Node.invokeFullyNodeModelExecute(Node.java:1236)
at org.knime.core.node.Node.execute(Node.java:1016)
at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:558)
at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:95)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:201)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:117)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:334)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)

Until I remove the empty array value from the list column and replace it with null (missing value).

Separately, sure would be nice to support Set columns in addition to List in the parquet nodes. That doesn’t seem like a big stretch. Perhaps even allow struct mappings to and from JSON?

Hi @bfrutchey -

I spent a little time trying to recreate the problem, but so far I’ve been unable to. Do you have a toy workflow you could share that I could pass on to the dev team?

See attached.parquet_test.knwf (16.5 KB)

Hi @bfrutchey -

Thanks for providing the workflow. I did a little more testing, and while I was able to recreate the problem using the deprecated nodes in your workflow, the updated nodes seem to work OK - so as far as I can tell the problem has been fixed in the latest version of KNIME AP 4.3.3.

2021-06-07 14_35_23-KNIME Analytics Platform

2 Likes

You guys are awesome! Thanks for digging in. More reasons to upgrade.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.