For my Chemoinformatics projects I have the need to process some very large sdf files, up and over 20GB.
Before throwing that at Knime I thought it might be a good idea to do some testing how Knime handles such files.
I used a relatively modest testing file of 234MB, and a workflow that reads the file, does some calculations and writes it back out; and came to the following results:
What | Result |
---|---|
1 big file of 234 MB | 29 minutes at 600% CPU utilization |
1 big file of 234 MB split (using a tool i made) in 18 files of around 13 MB each, fed into a loop | 6:45 at 150% CPU utilization |
1 big file of 234 MB chunked in blocks of 1001 rows using the Chunk loop within the workflow | Execute failed: Input table's structure differs from reference (first iteration) table: different column counts 12 vs. 13 |
1 big file of 234 MB chunked in automatic blocks using the Parallel Chunk Start + End nodes in the workflow | ERROR Advanced MolConverter Execute failed: GC overhead limit exceeded |
1 big file of 234 MB chunked in 40 blocks using the Parallel Chunk Start + End nodes in the workflow | ERROR Advanced MolConverter Execute failed: GC overhead limit exceeded + knime crashed |
1 big file of 234 MB chunked in 100 blocks using the Parallel Chunk Start + End nodes in the workflow | ERROR Parallel Chunk End Execute failed: Cell count in row "Row852" is not equal to length of column names array: 11 vs. 12 |
This came as quite a surprise to me: all the ways to chunk the data within Knime failed, probably due to the flexible nature of the sdf files.
And secondly: the pre-chunked data performed so extremely much better, 4 times faster at 4 times less CPU usage, no idea how that is possible.
Any insights here ?