Best practice on how to handle large data

sonnenburg · December 2, 2015, 2:15pm

Hello forum,

a general question on how to handle large data. Following scenario:

Node "A" produces result items with each size in the range of 10 MB. Node "B" consumes such items.

Should I dump the item content in string or binary cells, or write it to disk and just put the resulting file name into the output table?

I collected some "pros" and "cons" so far, but maybe I missed some essential things.

PRO dumping the content directly into the table

No need to create temporary files, and, even more important: no need to clean up! (Who is responsible for cleaning up? Do I know, how many subsequent nodes will need the temporary files?)
When saving the workflow, all data in the table is saved as well. I don't have to care of packing also the temporary files together with the workflow.

CONTRA

Tools inside the node, if not written natively in the node itself, normally act on file input and output. If I have the chance to modify the commandline interface of the underlying tool, I could add a stream input channel, but this is definitely some extra effort. If the tool is a "black box" which I can't modify, I even have to build some table-content-to-tool-input-file and tool-output-file-to-table-content workaround loop around the tool!
I can compress the content when writing to / reading from file.

Referring to the "compression" contra argument: isn't this done already automatically in knime?

Any thoughts and hints are welcome!

Best regards, Frank

weskamp · December 3, 2015, 9:33pm

Hi Frank,

I think that both approaches are feasible in principle. In my view, it depends mainly on two factors:

1. do you want to process your items inside of KNIME at any point using some of the existing nodes?

2. do you have a chance to store / handle / process your items more efficiently due to some special background knowledge?

To my best knowledge, KNIME does store its data in a ZIP-compressed format and also offers support for large items by means of e.g. a BlobDataCell implementation. The implementations within KNIME are typically quite good, but they also have a general purpose character and might be inefficient in some special cases.

There was a presentation at a KNIME UGM a couple of years ago by Genentech. They basically used KNIME as a GUI to build Unix pipe-based command line calls without ever importing data into KNIME.

Hope this helps,

Nils

sonnenburg · December 4, 2015, 2:55pm

Hi Nils,

thanks for your reply. I found these very comprehensible slides by Man-Ling Lee from Genentech: https://www.knime.org/files/ugm2013_talks/2013-03o_knimeugm_commandlinenodes_v3_forknime_genentech_final.pdf - I think you meant this talk describing a unix pipeline editor/factory.

You mentioned a quite good aspect for this thread: Is the data produced by the nodes "interesting" for other standard KNIME nodes?

My first concern for the nodes I have in mind - why I want to use KNIME here at all - is to benefit from the KNIME modularity. I did not think of other subsequent nodes that maybe can process the produced data as well.

Your post now triggers a new idea. I could go for some hybrid solution. Writing the big data to disk, putting the file name into the output table and generating additionally column(s) for possible subsequent nodes. In my special example the node produces a molecule fragment space in a proprietary format. But the molecule fragments inside can be exported into the output table so one can view a 2d picture of these fragments for example.

Nevertheless I will also check the BlobDataCell and similar. Thanks for sharing your thoughts.

Frank