Loading very big files with Mol2-Reader

Hi everyone,

I have a little question about the Mol2-Reader.
I defined a really big Mol2 file. Check here your self how big is it:

jackd@develb0x:~$ ls -la /storage/chem/bigfat.mol2 -rw-r--r-- 1 jackd jackd 1493588890 2009-04-02 21:14 /storage/chem/bigfat.mol2

It contains 300.000 Molecules.

Ok but if i start the Mol2-Reader to read the beast following error will be thrown:

ERROR KNIME-Worker-3 Mol2 Reader Execute failed: GC overhead limit exceeded

Is it possible to load so big files into my workflow?
I ask it because my nodes should be able to process very big files.
But if the Reader can’t read so big files I have to re-think about it :slight_smile:

Greetings,
imax.

Your are welcome to test if the Tripos mol2 reader node behaves better.
In case you want to beta test the Tripos Chemistry extensions for KNIME 2.0 please contact me:
loeprecht@tripos.com

Best regards,
Björn Loeprecht

Imax, could you try to set the “Memory Settings” of the Mol2 Reader to “Write tables to disc” and see if the GC error still appears? By default, 100.000 molecules of at max 8kb are stored in memory before they are written to discs. If you have 300.000 very big molecules this may need up to 780MB of memory. You may thus also increase KNIME’s heap size in knime.ini.

Hi thor,

I have already set the option to “Write tables to disc”. If i don’t select this option there will be an Heap-Space error. So basicly I have the choice between Heapspace-Error or GarbageCollector-Error :slight_smile:

Sure i can increase the Heapspace.
But i thought the option “Write tables to disc” is exactly to bypass the Heapspace Problems.
So if I have a Molecule file with 2.000.000 very big Molecules I would need more than 2GB which is impossible on a 32Bit machine (even with PAE) afaik.

I will continue trying. If I found a good solution i will let you know.

Greetings,
imax.

Hi imax,

Yes, the “Write tables to disc” method is supposed to avoid this memory problem. Can you check whether this problem still occurs when you use the “Generate row IDs” option? If that fixes the problem, can you tell us the typical length of the ids contained in the file (the title follows the “@<TRIPOS>MOLECULE” line).

The reason why I’m asking is that chunks of row ids (100000 at a time) are cached in memory to check for duplicates. If the average length of the ids is large this will of course consume more memory. The IDs generated by the “Generate row IDs” option are rather small – they will fit into memory easily.

Thanks for your help.
Bernd

Hi wiswedel,
Thanks for your help.
But it doesn’t work :frowning:
If I enable “Generate row IDs” the GC-Error will occur - but quitely! The Node progressbar is still bouncing from left to right. But no activity.
And in the console the error message will be thrown:

ERROR KNIME-Worker-0 ThreadPool An exception occurred while executing a runnable.
Apr 7, 2009 10:00:16 AM sun.awt.X11.XToolkit processException
WARNING: Exception on Toolkit thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Long.valueOf(Long.java:550)
at sun.awt.X11.XToolkit.windowToXWindow(XToolkit.java:348)
at sun.awt.X11.XToolkit.run(XToolkit.java:564)
at sun.awt.X11.XToolkit.run(XToolkit.java:519)
at java.lang.Thread.run(Thread.java:619)

The Mol2 I have created just have generic molecule names like:

{Molecule_1, Molecule_2, …, Molecule_300000}

Greetings,
imax.

Hi wiswedel again,

I tried it again with a fresh workflow and it seems to work!
Thank you very much :slight_smile: Now i can continue developing.

So the solution is to enable “Generate row IDs”.

Thanks again.
Greetings,
imax.

This is good news, though names like “Molecule_241343” shouldn’t cause problems. Their length is at most 15 characters, i.e. 30 bytes in the java world. If you have 300000 of them, this makes about 9MB of required memory (but then again, we keep at most 100000 of them in memory, i.e. 3MB). Even if you double that to accommodate the usual java object overhead, you have significant less than 2GB.

I have the suspicion that the reader reads more than just “Molecule_xxx” as key. I have seen similar problems before where a string seems to be small but it is actually just a view on a much larger string (which is then also kept in memory). I have just committed changes for the next bug fix release, which will make sure that the string is as small as possible (at least for the purpose of duplicate checking).

Ignoring all these technical details … could you run this test another time once KNIME v2.0.2 is out (it comes out today or tomorrow)?

Thanks for your help to isolate this problem.
Bernd

Hi wiswedel,

I just tested the new version. And it works w\o “Generate row IDs”!

Thank you very much.
Greetings,
imax.