Pairwise Tanimoto values

I've just started using KNIME, and I have to commend you on a fantastic piece of code. Nice work!

One of the typical analyses I do is to calculate the pairwise Tanimoto socres for all compounds in a particular SDF file. I can easily enough get the compounds in, computer fingerprints, even convert those fingerprints to a CSV file (with each bit as a seperate column), and to the calculation myself outside of KNIME. How do I perform such a process within KNIME? Ideally, I would like a table with two compounds per row and their Tanimoto similarities.

-Kirk

Hi Kirk,

I'm sure the Tanimoto score computation deserves a separate node implementation but until we (or you?) implement it, you could use the Java snippet node.

Assuming that both bit vectors are available as (two separate) string columns ("011010101.."), you could use the following code:

String bit1 = $bitvector_column1$; 
String bit2 = $bitvector_column2$; 
if (bit1.length() != bit2.length()) { 
  throw new RuntimeException("Different length"); 
} 
int unionCount = 0; 
int junctionCount = 0; 
for (int i = 0; i < bit1.length(); i++) { 
  boolean bit1IsTrue = bit1.charAt(i) == '1'; 
  boolean bit2IsTrue = bit2.charAt(i) == '1'; 
  if (bit1IsTrue && bit2IsTrue) { 
    junctionCount++; 
  } 
  if (bit1IsTrue || bit2IsTrue) { 
    unionCount++; 
  } 
}; 
double result; 
if (unionCount > 0) { 
  result = junctionCount / (double)unionCount; 
} else { 
  result = 1.0; 
}; 
result

You will need to change the first two lines and enter the correct column names, the return type of the expression is then a double.

Hope, you find it useful.
Bernd

That one is a good start (thank you), but I'm certainly looking for something a bit more complex. One issue with the code you supplied is that it assumes two fingerprints in the same row. This brings up the question of whether each row is dedicated to a single chemical structure or can you have multiple structures per row?

-Kirk

Hi Kirk,

I believed you had the two compounds and their accompanying fingerprints already in one row but now -- that I read your first post again -- I see that you have one structure per row and want to perform some sort of cross product of the table with itself (to get all possible pairs). Right?

I don't think that any of the currently available nodes allows you to do that. Sorry. You will need to implement this functionality yourself. It shouldn't be a big deal unless you care about performance (quadratic complexity - it's prohibitively expensive to iterate 1000+ times over a table, which resides on disc).

Best regards
Bernd

Yep, that is exactly what my intention was. Typically what I do is evaluate all the pairwise Tanimoto values, and for those above a given threshhold (usually ~0.85), I keep the pair. The resulting table I use look something like this:

Structure #1,Structure #2, Tannimoto, Difference in Activity

This allows me to identify all the compounds that are the most similar in structure but have a large change in activity. Kind of an SAR analysis.

As for developing a component myself, I am certainly not averse to that idea. I'm just getting started with KNIME, so it may be a while before I'm fully up to speed for developing my own plugin. One question I have (which I should just check myself - in due time) is whether the table in KNIME allows multiple structures per row. Is this possible?

-Kirk

That sounds doable if you do not refrain from implementing this cross-product node yourself. And yes, KNIME tables can handle more than one structure per row; the columns are completely independent from each other. The very only restriction is that there are no duplicate column names.

Best regards,
Bernd

That is good news, indeed. I hope to start working with KNIME more in the very near future. I'll keep you updated on my progress and if any components come to fruitions, I'll do my best to share them.