Issues with fingerprints for substructure searching


I’ve run into 2 issues with fingerprints and substructure searching. I’m using KNIME 4.0.2 with a just recently released new RDKit Nodes version. I just started experimenting today so issues are probably unrelated to versions.

Only fingerprint that seems to work with defaults is RDKit fingerprint.

Pattern Fingerprint:

AFAIK this is used by the RDKit cartridge? I’m not sure what I’m doing wrong but with my data set (very small molecules) I get a terrible screen-out rate. Way too many matches. tried 1024 and 2048 bit. Not sure how this is useful or the nodes default settings are terrible.

Layered Fingerprint:

Screens out valid matches with default flags of the KNIME node. Trying to reveal as little as possible, the wrongly screened out molecules contain part of my drawn pattern which is not in a ring, in a ring. This was pure accident I realized this.

Now I know very little about C++ and the flags used but as far as I can tell the default flag value in the node of 65535 means all layers enabled but since there are only 6 layers, 63 would also enable all layers right?

Looking at the layers then explain the screen-out errors:

Layer definitions:

  • 0x01: pure topology
  • 0x02: bond order
  • 0x04: atom types
  • 0x08: presence of rings
  • 0x10: ring sizes
  • 0x20: aromaticity

Anything above 0x04 will break a classic* substructure search, right? So for a valid substructure search fingerprint I would need to set the flags value to 7 (0000 0111) in the Knime Node? (doing this for query only seems to work)

if I’m correct with this analysis I greatly suggest the knime nodes documentation should be adjusted as it says:

  • Layered: An experimental substructure-matching fingerprint

With default settings actually not working correctly for a classic* substructure search. At least the defaults should be changed and the documentation should explain the flag values and which layers they activate. Even better would be being able to select the desired layers in the dialog and not expose this “C++ flag magic” to end-users.

I would have completely missed this if i hadn’t by accident chosen a a test substructure which causes the issue.

* of course in some cases having the extra layers actually is desired but for a classic substructure search (subgraph match) they lead to wrong results.