MultiSubstructure Matcher

Hi,

Is it possible to have a Substructure Matcher nodeto accept multiple query molecules into the second port, and the node runs a substructure search on the dataset with each query molecule. You then have the option to have:

1. Union Dataset of the results, which is the total matches returned across all the query molecules (but where there are multiple molecule matches across the queries, those molecules are only in the output once, not multiple times).

2. Exclusive Dataset of the results, which is the total matches returned across all the query molecules, minus any molecules which matched in ALL the query molecules.

3. Intersection Dataset of the results, which is the molecules which are matched in ALL the query molecule searches only.

Thanks,

 

Simon.

Simon,

There are some problems in defining the concept of the MultiSubstructure Matcher node well.

First, what is the Exclusive dataset if you have more than two queries? For instance, if the queries are named A, B, and C, should the result contain structures that match A and B but not C? Or it should contain only structures that match "A, not B, not C", "B, not A, not C", "C, not A, not B"?

Second, do you want to allow the matches to overlap (i.e. to be independent from each other)? If yes, then you can do everything you need by combining a few Substructure Matcher nodes, or looping over one node.

In case you need non-overlapping matches, that raises another question -- do you need minimum amount of non-overlapping matches, or maximum? Say the structure is "NCCO" and the queries are "NC", "CC", and "CO". You can have here a single match ("CC") or two matches ("NC" + "CO"). What to choose? In any case, it would be quite a problem to implement a good algorithm that would give a determined result for this task on large amount of input queries.

 

Best regards,

Dmitry

HI,

Thanks for the detailed response, in terms of the exclusive dataset, I was meaning the Union minus the Intersection if that makes sense. So all the hits, except the ones which came up in everyone one of the searches.

In terms of the searches, I was thinking of each being run independently and then they are combined in the background according to whether Union, Intersection, or Exclusive is wanted.

I appreciate you can do the searches with separate substructure Matcher nodes and then join together, or as you say, by doing a Loop. However, it means it becomes tricky to get the desired outcome, as the list will contain some molecules twice or more where it came up in more than one query. To get the Intersection or Union will require you to convert everything to Canonicalised Smiles and using the Reference Row Filter node to get the desired outcome of subtracting one set of molecules from another, and then converting back to Indigo to do further analysis, it all becomes a bit messy.

I hope you can see where I am coming from.

Thanks,

Simon.

Hello Simon,

The latest version of Substructure Matcher node works with multiple number of queries. There is an option to match all queries, or some fixed number of queries. Also there is an option to append a column with the number of queries matched.

For Union Dataset behavior you need to use default option: "Match at least 1 query".

For Intersection Dataset behavior you need to use "Match all queries" option

For Exclusive Dataset behavior you can use "Match at least 1 query", and then filter the result by the number of queries matched: greater or equal number of queires - 1. 

Exclusive Dataset is not very intuitive, and I decided not to add such option. If you can propose some intuitive user interface description for all of these cases, then we might add such option explicitly.

Best regards,
Mikhail

Hi Mikhail,

I'm not sure whats happened, but there now seems to be a bug in the Substructure Matcher node.

If I take a dataset of 300 molecules, and use 1 query molecule to do a substructure with, the node freezes after 6% completion.

If I process the same molecules, but process 15 at a time using "Chunk Loop Start and End" nodes, the Substructure node works, if I process 20 at a time the Substructure node freezes.

When the node freezes it also causes KNIME to freeze up resulting in the onscreen graphics going strange.

I am unsure what the problem is here, this didnt use to happen.

Simon.

I'm also finding the Feature Remover node, is randomly deleting structures too leaving a Missing cell.

Something seems to be wrong with the Indigo build!!

I've uninstalled the Indigo nodes and reinstalled them but the problem remains.

Simon.

These are the type of errors I am getting;

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 1 from file "knime_container_20111031_4989205758741791838.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR Substructure Matcher Execute failed: org.knime.core.data.DataType$MissingCell cannot be cast to com.ggasoftware.indigo.knime.cell.IndigoMolCell

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 1 from file "knime_container_20111031_4989205758741791838.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 1 from file "knime_container_20111031_4989205758741791838.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 12 from file "knime_container_20111031_3011741161537590404.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 12 from file "knime_container_20111031_3011741161537590404.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 12 from file "knime_container_20111031_3011741161537590404.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR Substructure Matcher Execute failed: org.knime.core.data.DataType$MissingCell cannot be cast to com.ggasoftware.indigo.knime.cell.IndigoMolCell

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 12 from file "knime_container_20111031_3011741161537590404.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 21 from file "knime_container_20111031_7618437206632658824.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 21 from file "knime_container_20111031_7618437206632658824.bin.gz": CMF loader: cannot decode bond: code 6; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 38 from file "knime_container_20111031_5765603545419968026.bin.gz": CMF loader: cannot decode bond: code 8; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR BufferFromFileIteratorVersion20 Errors while reading row 38 from file "knime_container_20111031_5765603545419968026.bin.gz": CMF loader: cannot decode bond: code 8; Suppressing further warnings.

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

ERROR IndigoMolCell Error while unserializing Indigo object

Hello Simon,

This is really strange. What operating system are you using?

Could you provide some workflow with such behaviour?

There is 09907006_indigo-tpsa-example workflow example uploaded on KNIME Example Flow Server under 099_Community > 07_Indigo > 09907006_indigo-tpsa-example. Could you check it? On my machine it works fine.

Best regards,
Mikhail

The operating system is Windows XP Service Pack 3, 32bit.

Unfortunately I am unable to provide the workflow due to sensitive structures.

I will try the workflow you mention and let you know

Simon.

Simon,

I have fixed this issue. Everything should work fine now. You can update Indigo nodes up to the version 1.1.0.0001135.

Thank you for the feedback and for the promt bug report!

Best regards,
Mikhail

Hello Simon,

KNIME has changed version numbering and the latest Indigo from the nightly build has version 1.1.0.201110312246. You can check this version. Reported bug is fixed there.

Best regards,
Mikhail

Hi Mikhail,

I have updated Indigo nodes to the latest build and unfortunately the substructure search node is still not working, it seems to hang after 19 rows of substructures.

Any ideas what the problem is ?

Thanks,

Simon.

Hello Simon,

Have you tested 09907006_indigo-tpsa-example example?

Could you write, how many target molecules and query molecules do you have? What are the input formats and what options are used in the Substructure Matcher node?

Best regards,
Mikhail

Hi Mikhail,

Sorry for the delay.

I can confirm I have tried the Indigo-TPSA example and this works perfectly fine. I also connected the "Substructure Matcher" node up to the dataset within this workflow and it works fine.

However, the molecule dataset I have still causes the substructure Matcher node to freeze after 19 rows, and its not specific to the molecule in the 19th row as if I delete the first 20 rows and rerun it, it again stops on the new 19th row! Again, if I run the workflow in chunks of 10 rows at a time, the substructure node works fine.

The settings I am using within the substructure matcher node are just the default settings, of matching at least 1 query, normal mode used, and no options ticked. I am using just 1 query molecule, and 304 molecules to search against.

I have checked the structures through the Valence Checker node, and nothing is flagged. I have also run the molecules through the feature remover and selected to remove everything, and then put it through the Substructure Matcher node and again it freezes. If I use the Substructure Match Counter node, this works fine.

I have used this same dataset on an earlier version of the Indigo nodes, and it worked perfectly with the Substructure Matcher node. The structures I am feeding in are normal organic molecules with Molecular Weight around 300, some have chirality, others dont. I have tried Aromatising them and Dearomatising them, but the problem remains. I have tried having the input structures in both Smiles and SDF format which I convert to Indigo format using the Molecule to Indigo node for the 304 molecules, and the query molecule to Indigo for the 1 query molecule.

 

Thanks

Simon.

Hello Simon,

I don't want to ask too many questions, but I still cannot repoduce this bug. I have tested it on the virtual machine with 32-bit Window XP, and it works fine on our test set.

Is this bug query-dependent? Could you reproduce it on the other query?

Could you reproduce it if you are using other set of molecules?

Could you try to install Indigo nodes from the stable releases, and check if you can reproduce it?

Best regards,
Mikhail

Firstly, I can answer that the new stable release also has the same issues.

I will see if I can reproduce the bug on other sets of molecules.

Simon.

Hi Mikhail,

I havent managed to reproduce the bug on non-sensitive molecules yet.

But I can tell you the substructure node works fine with Indigo release 1.0.0.0000965.

But does not work with Indigo release 1.1.0.20110281309 onwards.

 

I have done four fresh KNIME installations on two different PC's and the outcome is always the same.

 

Thanks,

Simon.

Simon,

Thank you very much for providing this detailed investigation! After the detailed review I have found some modification, that causes a freeze. And I was able to reproduce it: it is necessary simply turn on "write to disk" strategy on the Memory Management tab.

I have committed the bugfix, and you can check Indigo nodes from the nightly builds on November 5 morning. Thank you again! When you report that it works, I will integrate this bugfix into the stable branch.

Best regards,
Mikhail

Hi Mikhail,

Sorry I wasnt able to pin down the bug more specifically for you! But I can tell you that whatever change you have made overnight has fixed the problem.  The substructure matcher node is now working without freezing up.

Thanks for all your hard work in working out where the bug was. Its much appreciated :-)

Thanks,

Simon.