Normalizing file size

jwmason · May 16, 2020, 6:22pm

Dear Knime community,

I’ve downloaded the Zinc database and filtered based on the properties I wanted. I now have ~1000 files with the compounds I want, but some files have 1 molecule (in SMILES format) and some files have >10 million molecules. I’d like to normalize the files so that each one has 1 million molecules.

I can use the Chunk Loop to write files of 1 million molecules, but this requires that I first read in the entire set of molecules (>200 million) and my computer can’t handle this. I’m trying to develop a workflow that:

reads in files until >= 1 million molecules are available in a “buffer table”.
split off 1 million molecules and write this to a file; continue writing files of 1 million molecules until there is no longer enough molecules in the buffer table.
loop back to step 1 and continue until all files are processed.

I’m using the following workflow to read in multiple files:

I’m having a hard time modifying this workflow to perform the desired loop described above. Would anyone be able to help?

Many thanks,
Jeremy.

elsamuel · May 17, 2020, 4:10am

Can you upload your workflow here?

jwmason · May 18, 2020, 12:37pm

Hi elsamuel,

I don’t have a workflow that accomplishes this - I’ve tried many things but can’t seem to make this work. The only way I’ve been able to accomplish this is by manually loading a small number of files at a time into the attached workflow. It loads in all the data, then uses a Chunk Loop to write out even files. But if you have very large datasets this is rather annoying. It would be great if the file loading loop could be modified to load in some data, write it to a file, then continue loading in more data. Hope this helps explain the issue.

File_Equalizer_Example.knwf (58.7 KB)

Thanks!

ipazin · May 20, 2020, 1:31pm

Hi there @jwmason,

welcome to KNIME Community!

In this case streaming functionality could be very useful. Give it a try and let us know how did it go!

Br,
Ivan

system · November 19, 2020, 1:34am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.