Maximum Size of ECFP4 Fingerprint Size

Hi, 

I used the RDKit Fingerprint node to generate ECFP4 (Circular fingerprint using Morgan algorithm) fingerprint in different sizes. I've found out that the minimum size can be generated is 32 (2^5). Can I know what is the maximum size of fingerprint that can be generated by the node?

I've tried until 65536 (2^16) and checked the number of set bits using the Fingerprint Properties node. However, the number and size is different than the one that I checked manually using the Java code. 

Hence, can somebody confirms me about the maximum size since I couldn't find it any where else in the documentation.

Thank you.

LM.

Hi,

from the algorithm perspective there should not be any maximum size for the Morgan fingerprint. There is only a technical limit, which is according to the RDKit API (in C++) the maximum integer as this is the datatype used to specify the number of bits to use.

I am not sure what you mean by your "Java code" statement. Maybe you are calling a different API than the RDKit Node in KNIME? I am calling the following Java API in the KNIME node:

org.RDKit.RDKFuncs.getMorganFingerprintAsBitVect(mol, radius, numBits)

Documentation can be found here: http://www.rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprintAsBitVect

Kind regards,
Manuel

Hi,

This would be a bit easier to help with if you could let us know why you're interested in generating the very long fingerprints.

If you're concerned about bit collisions caused by the small fingerprints, it might be worth looking at these blog posts, where I spent some time looking at exactly this topic:

http://rdkit.blogspot.ch/2016/02/colliding-bits-iii.html

http://rdkit.blogspot.ch/2014/02/colliding-bits.html

-greg

 

Hi Manuel, 

Thank you for your reply. Yes I was calling a different API. Your information is highly appreciated.

Regards,

LM

Hi Greg,

Thank your for your reply. I am interested in finding the effect of similarity on multidimensional fingerprint. Thus, I'm looking for a method to provide me a meaningful set of fingerprint sizes.

I've read your posts about the bit collision and a paper which aimed the the same thing. Here's the link. Both of the works are brilliant and give me ideas.

Just to share with you, I've tested on larger fingerprint sizes and it gives me up until 262k (2^18) as the largest size to ensure there's no collisions. To me, this is interesting!

Regards,

LM 

Hi, I have additional question to this..

I need to process the RDKit fingerprints (BitVector type) in Java which involve operations such as AND, OR, XOR. I wanted to used the 'Fingerprints Expander' Node to process the fingerprints since KNIME Java can only recognize BitVector as String variable. 

I can execute the node if I use BitVector size of  8192 (power of 13) or below. However, if I want to execute BitVector size of 16,384 (power of 14) I get the error saying: Execute failed: For input string: "."

I've asked the forum about the error (to the Erl Wood group since the node expander is theirs). However, do you guys know what's actually happening? Do you know any other way that I can use to process the bitvectors? 

Regards,

LM

I assume that the problem is that the expander is trying to create at table with 16K columns in KNIME. That's not going to be efficient.

If you really want to work with such large fingerprints, you may want to use the Python nodes. Assuming that you have a python installation with access to the RDKit, you can construct an RDKit bit vector object like this:

from rdkit import DataStructs
for fps in input_table['mfp2']:
    fp = DataStructs.CreateFromBitString(fps)

You can do Python's usual bit operations on this:

fp1 & fp2

fp1 | fp2

etc.

-greg

Hi Greg,

I've got Python installed and now trying the Python nodes (I'm new to Python though so please bear with me).

I understand that I have to construct the RDKit bit vector object and tried your above code. However, I got this error here:

Traceback (most recent call last):
  File "C:\Program Files\KNIME\plugins\org.knime.python_3.1.2.v201603040957\py\PythonKernel.py", line 282, in execute
    exec(source_code, _exec_env, _exec_env)
  File "<string>", line 7, in <module>
ArgumentError: Python argument types in
    rdkit.DataStructs.cDataStructs.CreateFromBitString(unicode)
did not match C++ signature:
    CreateFromBitString(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >)

 

What could be wrong with my Python installation? Thank you.

LM

Hi,

It's likely not your python installation. It looks like you are seeing a unicode problem.

Try calling CreateFromBitString(str(value)) instead of CreateFromBitString(value);

That might help,

-greg