Testresults processing large files

For my Chemoinformatics projects I have the need to process some very large sdf files, up and over 20GB.

Before throwing that at Knime I thought it might be a good idea to do some testing how Knime handles such files.

I used a relatively modest testing file of 234MB, and a workflow that reads the file, does some calculations and writes it back out; and came to the following results:

What Result
1 big file of 234 MB 29 minutes at 600% CPU utilization
1 big file of 234 MB split (using a tool i made) in 18 files of around 13 MB each, fed into a loop 6:45 at 150% CPU utilization
1 big file of 234 MB chunked in blocks of 1001 rows using the Chunk loop within the workflow Execute failed: Input table's structure differs from reference (first iteration) table: different column counts 12 vs. 13
1 big file of 234 MB chunked in automatic blocks using the Parallel Chunk Start + End nodes in the workflow ERROR     Advanced MolConverter              Execute failed: GC overhead limit exceeded
1 big file of 234 MB chunked in 40 blocks using the Parallel Chunk Start + End nodes in the workflow ERROR     Advanced MolConverter              Execute failed: GC overhead limit exceeded + knime crashed
1 big file of 234 MB chunked in 100 blocks using the Parallel Chunk Start + End nodes in the workflow ERROR     Parallel Chunk End                 Execute failed: Cell count in row "Row852" is not equal to length of column names array: 11 vs. 12

 


 

This came as quite a surprise to me: all the ways to chunk the data within Knime failed, probably due to the flexible nature of the sdf files.

And secondly: the pre-chunked data performed so extremely much better, 4 times faster at 4 times less CPU usage, no idea how that is possible.

Any insights here ?

Hi Ellert,

Thanks for sharing this! What tool have you used for the calculations ("Advanced MolConverter"???)? Which nodes take longest to execute? Do you see any difference between a plain SDF Read + Write when you use the full file vs. the split files?

Thanks,
  Bernd