Testresults processing large files

Ellert_van_Koperen · July 4, 2015, 12:27am

For my Chemoinformatics projects I have the need to process some very large sdf files, up and over 20GB.

Before throwing that at Knime I thought it might be a good idea to do some testing how Knime handles such files.

I used a relatively modest testing file of 234MB, and a workflow that reads the file, does some calculations and writes it back out; and came to the following results:

What	Result
1 big file of 234 MB	29 minutes at 600% CPU utilization
1 big file of 234 MB split (using a tool i made) in 18 files of around 13 MB each, fed into a loop	6:45 at 150% CPU utilization
1 big file of 234 MB chunked in blocks of 1001 rows using the Chunk loop within the workflow	Execute failed: Input table's structure differs from reference (first iteration) table: different column counts 12 vs. 13
1 big file of 234 MB chunked in automatic blocks using the Parallel Chunk Start + End nodes in the workflow	ERROR Advanced MolConverter Execute failed: GC overhead limit exceeded
1 big file of 234 MB chunked in 40 blocks using the Parallel Chunk Start + End nodes in the workflow	ERROR Advanced MolConverter Execute failed: GC overhead limit exceeded + knime crashed
1 big file of 234 MB chunked in 100 blocks using the Parallel Chunk Start + End nodes in the workflow	ERROR Parallel Chunk End Execute failed: Cell count in row "Row852" is not equal to length of column names array: 11 vs. 12

This came as quite a surprise to me: all the ways to chunk the data within Knime failed, probably due to the flexible nature of the sdf files.

And secondly: the pre-chunked data performed so extremely much better, 4 times faster at 4 times less CPU usage, no idea how that is possible.

Any insights here ?

wiswedel · July 9, 2015, 10:44pm

Hi Ellert,

Thanks for sharing this! What tool have you used for the calculations ("Advanced MolConverter"???)? Which nodes take longest to execute? Do you see any difference between a plain SDF Read + Write when you use the full file vs. the split files?

Thanks,
Bernd