duplicate

dear all,

how can i remove duplicate structure?

 

Convert structures to Canon Smiles (Canonicalised or Unique Smiles) and then use the GroupBy node and group by smiles,and aggregate on all the other columns as "First Value" and choose to keep original column names.

To get Canon Smiles,you can use the RDKit Canon Smiles or also use Indigo to Molecule node.

Hope that helps,

Simon.

1.Simon, won't groupBy also work for sdf text columns?

2. Another way would be to loop over all molecules and use exact substructure matching .

3. If you use openbabel then just enter

babel inputfilename.sdf outputfilename.sdf --unique cansmiNS

It may be possible to use the command line /external execution node to run this possibly.

Task: Checking for duplicates and create a SDFile with new and duplicate structures from the input SDFile, if possible with ID numbers.

I have a small SDFile and a huge Oracle Chemaxon database with 15 million compounds. It seems to me procedure 2 would be a sensible approach.

Is there somewhere a workflow that could get me started. How to connect to the Oracle database? Would this work with the default ODBC driver?

Alex