Issues with fingerprints for substructure searching

beginner · December 5, 2019, 1:05pm

I’ve run into 2 issues with fingerprints and substructure searching. I’m using KNIME 4.0.2 with a just recently released new RDKit Nodes version. I just started experimenting today so issues are probably unrelated to versions.

Only fingerprint that seems to work with defaults is RDKit fingerprint.

Pattern Fingerprint:

AFAIK this is used by the RDKit cartridge? I’m not sure what I’m doing wrong but with my data set (very small molecules) I get a terrible screen-out rate. Way too many matches. tried 1024 and 2048 bit. Not sure how this is useful or the nodes default settings are terrible.

Layered Fingerprint:

Screens out valid matches with default flags of the KNIME node. Trying to reveal as little as possible, the wrongly screened out molecules contain part of my drawn pattern which is not in a ring, in a ring. This was pure accident I realized this.

Now I know very little about C++ and the flags used but as far as I can tell the default flag value in the node of 65535 means all layers enabled but since there are only 6 layers, 63 would also enable all layers right?

Looking at the layers then explain the screen-out errors:

Layer definitions:

0x01: pure topology
0x02: bond order
0x04: atom types
0x08: presence of rings
0x10: ring sizes
0x20: aromaticity

Anything above 0x04 will break a classic* substructure search, right? So for a valid substructure search fingerprint I would need to set the flags value to 7 (0000 0111) in the Knime Node? (doing this for query only seems to work)

if I’m correct with this analysis I greatly suggest the knime nodes documentation should be adjusted as it says:

Layered: An experimental substructure-matching fingerprint

With default settings actually not working correctly for a classic* substructure search. At least the defaults should be changed and the documentation should explain the flag values and which layers they activate. Even better would be being able to select the desired layers in the dialog and not expose this “C++ flag magic” to end-users.

I would have completely missed this if i hadn’t by accident chosen a a test substructure which causes the issue.

* of course in some cases having the extra layers actually is desired but for a classic substructure search (subgraph match) they lead to wrong results.

manuelschwarze · March 3, 2020, 11:41am

Hi @beginner (although, your name doesn’t seem to be true ),

thanks for taking the time last year to present that case so thoroughly.

Pattern Fingerprint:
I checked the KNIME node implementation, but there is no magic when calling the RDKit functionality - the simple settings you experimented with are just passed through:

fp = RDKFuncs.PatternFingerprintMol(mol, settings.getNumBits());

If there are any strange default settings applied as you assume this would need to be addressed on RDKit level.

Layered Fingerprints:
I agree with your argumentation about the layered fingerprints. I will update the node documentation for now, and in the longer run we could improve the node to make it easier to combine the right flags as you suggested.

I am a bit wondering why it is actually hard to find the documentation about the layer definitions that you posted here (without going through C++ code). I found them only via Google in one of Greg’s presentations about fingerprints: https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf

Would be interesting what @greglandrum says about these things and what he would suggest as improvements. Also wondering, if in the future additional layer flags will be added.

-Manuel

system · April 21, 2023, 9:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.