Does the Table Writer write unnecessary data?

AngusVeitch · January 9, 2019, 3:42am

When exporting data tables in Knime’s native format with the Table Writer, I often find that the exported file is larger than I expect. Furthermore, I have found that removing rows from the table prior to writing makes no difference to the size of the exported file. This suggests that the node is writing information that has apparently been removed from the table. Using the Cache node prior to writing seems to make no difference.

In most or all cases where I have observed this behaviour, the exported table contains a column of text documents in Knime’s native document format. I don’t know if this affects the observed behaviour in any way, but I have found that converting the documents to plain text and then back to documents results in the sampled output file being a more appropriate size.

In a current example, a table of 700 documents results in an output file of about 160MB. Using the row sampler to reduce the table to 100 rows still produces a 160MB file. However, converting those documents to strings and back again (using the Document Data Extractor and then the Strings to Document nodes) produces a file of only 24MB.

If the Table Writer is supposed to work this way, then is there any way to force it to purge the unwanted data? In the case of documents, I don’t want to convert them to strings and back because doing so often corrupts the word tokenisation (a separate problem that has been discussed elsewhere).

And if the Table Writer is not supposed to work this way, is this a known bug?

Thanks for any help you can offer.
-Angus

wiswedel · January 9, 2019, 9:36am

Text documents in a workflow are kept in “file stores”, whereby the number of documents is controlled via a configuration option in “File” -> “Preferences” -> “KNIME” -> “Textprocessing” -> “Storage”. If you reduce the number the exported file will be more compact.

AngusVeitch · February 7, 2019, 5:21am

Thanks - this does seem to have helped!

It’s just a bit inconvenient that I will have to change the setting each time I want to save a smaller version of the output, but we can’t have everything I suppose!