Normalizing file size

Dear Knime community,

I’ve downloaded the Zinc database and filtered based on the properties I wanted. I now have ~1000 files with the compounds I want, but some files have 1 molecule (in SMILES format) and some files have >10 million molecules. I’d like to normalize the files so that each one has 1 million molecules.

I can use the Chunk Loop to write files of 1 million molecules, but this requires that I first read in the entire set of molecules (>200 million) and my computer can’t handle this. I’m trying to develop a workflow that:

  1. reads in files until >= 1 million molecules are available in a “buffer table”.
  2. split off 1 million molecules and write this to a file; continue writing files of 1 million molecules until there is no longer enough molecules in the buffer table.
  3. loop back to step 1 and continue until all files are processed.

I’m using the following workflow to read in multiple files:

I’m having a hard time modifying this workflow to perform the desired loop described above. Would anyone be able to help?

Many thanks,
Jeremy.

Can you upload your workflow here?

Hi elsamuel,

I don’t have a workflow that accomplishes this - I’ve tried many things but can’t seem to make this work. The only way I’ve been able to accomplish this is by manually loading a small number of files at a time into the attached workflow. It loads in all the data, then uses a Chunk Loop to write out even files. But if you have very large datasets this is rather annoying. It would be great if the file loading loop could be modified to load in some data, write it to a file, then continue loading in more data. Hope this helps explain the issue.

File_Equalizer_Example.knwf (58.7 KB)

Thanks!

Hi there @jwmason,

welcome to KNIME Community!

In this case streaming functionality could be very useful. Give it a try and let us know how did it go!

Br,
Ivan

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.