Table Writer: Add option to compress data

mwiegand · March 27, 2025, 7:44am

Hi,

while comparing compression effciency, I currently process a large data set, I noticed that the parquet writer allows to compress, substentially saving a lot of space compared to the table writer.

Having compression natively supported in the table writer node, as it is in the CSV writer with gz, would be nice.

Best
Mike

hotzm · March 27, 2025, 10:38am

To my knowledge our table files for the row-based backend (default) are snappy compressed. I think the reason you see such a difference is because Parquet is columnar and thus groups similar values naturally together, making it trivial to compress (e.g. using run length encoding of the same cell value). In contrast, the row-based backend table format stores each row after the other, so the cells from the same column are spread out over the table file, making it harder for the compression algorithm to compress.

You can rename the .table file to .zip and decompress it. In there you will find another data.bin file that contains the row data (but cannot be decompressed with OS tools, since that is the KNIME table with Snappy compression).

hotzm · March 27, 2025, 10:39am

You could try to write your data with the columnar-backend and see if that improves the compression ratio. For that I think you need to set the whole workflow to use that backend.

mwiegand · March 27, 2025, 11:22am

Interesting to know

That as well, going to check that once processing finished including the columnar backend appraoch.