Using KNIME to filter SD file against database

ChemMitch · October 10, 2014, 6:19pm

Hello,

I have an SD file that I would like to compare against a database (Direct cartridge in Oracle) and find the SD file records that are NOT in the database.

My first incination was a workflow that starts with an SD File Reader and then a Database Looping node to run exact match queries. However, the Database Looping node is generating errors. (Could not determine table spec from database query: ORA-29902: error in executing ODCIIndexStart() routine)

The same query works in SQL Developer so the Database Looping node is generating different SQL than expected.

It occurred to me that there might be a more efficient strategy to check an SD file for database overlap, so I'm posting this question to see what others have done.

Best regards,

Mitch

James_Davidson · October 14, 2014, 8:01am

Hi Mitch,

How big is the database? For small databases, I sometimes just retrieve everything into KNIME, then do any matching locally.

Alternatively, if you already have eg INCHIKEYs generated in your database, then this can be a fast / easy way of allowing cartridgeless exact-match searches (by first generating INCHIKEYs using eg RDKit or Indigo nodes).

But to be honest, I would also default to doing more-or-less what you tried with the database node... The ORA error suggests a problem with an index(?) - but I don't understand why you wouldn't get a similar issue via SQL Developer.

Kind regards

James

Aaron_Hart · October 14, 2014, 11:37am

I was just reading about this today by chance, I belive it is called an upsert (update/insert) or MERGE in SQL. We don't have direct support for this in KNIME, but you could probably create a temporary table in your DB based on your "upsert" data from KNIME and then use a Database SQL Executor node to do the MERGE. For oracle specific docs on merge, see here:

http://docs.oracle.com/cd/E18283_01/timesten.112/e13070/state.htm#insertedID46

Please report back and let us know if this works, if so it might make an interesting addtion to our database nodes in the future.

Regards,

Aaron

Ellert_van_Koperen · October 14, 2014, 10:28pm

Still interesting to understand why you are getting that error.

Can you isolate the data rows that are generating this error, and run them outside a loop?
Or even better, run those queries in a tool like toad to see what the database is giving you back for those queries.

It could be something rather trivial, like a null value that is not handled properly.
If that is the problem it is something you can fix within the sql statement using NVL(columnname, '') to replace the null with an empty string, or by using a view that does such a thing.

ChemMitch · October 17, 2014, 6:08pm

To process my SD file last week, I went a different route to find the overlap (loaded the SD file into a temp table and compared.)

But I'd like to understand what's going on.

So, Here is the SQL within the Database Looping node:

SELECT ID FROM newdb_2d_MOLTABLE WHERE FLEXMATCH(CTAB, '#PLACE_HOLDER_DO_NOT_EDIT#', 'MATCH=ALL') =1

with the selected field as either 'SDF Molecule' or a 'mol_string' column that I populate within a Java Snippet. (Results are the same.)

and here is the SQL (from knime.log) that is getting sent to the database:

SELECT * FROM (SELECT * FROM newdb_2d_MOLTABLE WHERE FLEXMATCH(CTAB, '', 'MATCH=ALL')) tempTable_3555961480094707744 WHERE rownum <= 0

It looks like Knime is not replacing '#PLACE_HOLDER_DO_NOT_EDIT#' with an actual value.

Does anyone know how to make the substitution work?

beginner · October 22, 2014, 12:12pm

To the initial question: I don't see any other way of doing this. There isn't really an efficient way. The best being as said using inchi(key) or canonical smiles. Then it will be very fast if column is indexed. But there is no way around the looping (eg. 1 query per molecule in sd-file) which will then be the limiting factor.

As for the last question I never had a problem with placeholder not being replaced. I've pretty much used it in the same way (different cartridge) and it worked.

In your case I suspect it might be due to using sd-file which will contain line breaks which might break the SQL statement building. My suggestion is to convert to smiles (in knime) and try that for the structure query,

Ellert_van_Koperen · October 23, 2014, 12:05pm

Did you exclude the data being a problem by feeding it a very simple sdf with for example only one record of e.g. methane in it?

Also, note that logging often times skips blobs (probably assuming it is not of any use to log binary data), so make sure that you actually see what is sent, not some 'sanitized' version of it.

ChemMitch · October 24, 2014, 1:07am

I just tried the workflow using a single-record SD with methane and still see the same issue.

When I copy/paste the SQL from the log file into SQL Developer, I see the same message as Knime shows -- this very strongly suggests that the node is not making the substitution correctly.

beginner · October 24, 2014, 10:31am

Have I tried, as I suggested, usign smiles instead? To repeat myself I suspect the line-breaks in the sd-file might break the replacement.

EDIT:

I agree with OP. This is a bug in the database looping node. The placeholder does not get replaced correctly when in a function call.

The solution is to use the TableRow to variable loop start node with the Database Reader node. The Database Reader node works correctly with the excat same SQL except using a flow variable as placeholder.

Fails in Database looping Node:

SELECT
	reg_number 
FROM 
	vw_reg_structures
WHERE
	CsCartridge.MoleculeContains(base_64_cdx, '#PLACE_HOLDER_DO_NOT_EDIT#','','')=1

Works in Database Reader Node:

SELECT
	reg_number
FROM 
	vw_reg_structures
WHERE
	CsCartridge.MoleculeContains(base_64_cdx,'$${SMolecule}$$', '', '')=1

thor · October 24, 2014, 10:54am

In order to determine the table structure while the node is configured (not executed), the placeholder is replaced by an empty string (since no values are known at that time). This is usually no problem if it occurs in a string comparison (col = '') or an IN block (col IN ('')). However it seems that the FLEXMATCH function you are using cannot handle an empty string input.

You can try to use the new "Database Connector" nodes and configure the connection to not "Retrieve metadata in configure".

ChemMitch · October 25, 2014, 12:50am

Thanks everyone! The TableRow to Variable and Database Reader nodes worked!

The reason I didn't want to use SMILES was the loss of information going from MOL to SMILES.

Thor, when I tried a Database Connector node in front of the Database Looping node and made sure 'Retrieve metadata in configure' was unchecked, the Database Looping node still gave an error.

Thanks to all!
Mitch

thor · October 25, 2014, 9:29am

The placeholder does get replaced. It's simple string replacement that works regardless of how the replacement text looks like. There is something else going wrong here. Can you share a workflow?

ChemMitch · October 26, 2014, 3:39am

Here is the form of the workflow that produces the errors.

I'm interested in seeing what you find.

sdf_table_errors.zip

ChemMitch · October 26, 2014, 4:12am

One more question related to the workflow I'm trying to run.

A new form of the workflow has SDF Reader -> Java snippet (simple) -> Java snippet row splitter ->Table Row to variable Loo Start -> Database reader -> Loop end

and this runs without error, however, the the loop starts off running very fast (1000 records in a minute) but then becomes progressively slower. The SD file has about 40,000 records and would be prohibitively slow to run to completion so I'm testing it on a small subset.

Are loops with Database Reader known to be slow?

I'm testing this with a database on my desktop so there should be no network issues.

beginner · October 28, 2014, 6:20am

You are executing 40'000 heavy SQL statements (chemical strucutre searches) . So yes, that's always going to be rather slow. And of course it then depends on your hardware and database configuration. CPU and having an SSD (or not) will have a considerably impact. But so will the amount of memory assigend to the database engine.

Speed could sure be improved but all in all, it will remain a very slow operation.

thor · October 28, 2014, 4:41pm

The Database Looping node didn't respect the "Retrieve metadata during configure" setting from the connector node. This is fixed in 2.11.