Problems with the CDK fingerprint plugin

First of all I would like to thank and congratulate the KNIME team for making such a marvellous and user-friendly piece of software. I am sure it will be of great use to a lot of people.

I have experienced a problem with calculation of molecular fingerprints using the CDK plugin version 1.2.0 BETA in the latest version of KNIME. When I pass a CDK format molecule to the CDK Fingerprint node it outputs the fingerprint in an integer column that only displays one number instead of a string of numbers as I had expected. That is the DataTable displays the column as type I while the DataTableSpec says the type is BitVecCell and the DataColumnProperties tab says the column is a string. Am I doing something wrong or is there a work around? Have any of you tried using the Fingerprint node with succes?

Actually the output is correct (at least it's the outcome that we thought of when we implemented the node). The number that you see there is the cardinality of the generated fingerprint (how many bits are on). The cell itself is of type BitVectorCell, the rendering just doesn't tell you that (but if you look at the icon of the fingerprint column in the header row of the table, you see that this is not an ordinary int). I can go ahead and add two more different renders for this type of column: One displaying the bit string, the other a hex representation.

So where can you use it? We have successfully used the generated fingerprints in conjunction with the Neighborgram node (additional plugin, see here: http://knime.org/download_extensions.html#neighbor): It uses the fingerprints (along with the tanimoto distance) to construct neighborgrams, which may help to identify groups of similar molecules. Some other implementations of learning algorithms/nodes may support fingerprints as descriptors in the future as well. Feel free to contribute.

Quote:
That is the DataTable displays the column as type I while the DataTableSpec says the type is BitVecCell and the DataColumnProperties tab says the column is a string

Actually that is a good one: The DataColumnProperties panel should not display that icon. I will report that as a bug, thanks.

Quote:
Have any of you tried using the Fingerprint node with succes?

Yes, I did. 8)

Thanks for the reply!
Ok, usage was a little different from what I had thought. I tried to output the fingerprint column with the CSV Writer and only got the cardinality number thinking that number to be a single fingerprint bit ID.
The ARFF Writer node generates an output of the fingerprint as a continues (no spaces) bit string like (e.g. 0001001100… ). This is closer to what I wanted.
May I suggest, a node that does the inverse of the Bitvector Generator node so that one can transform the bitvector string into a sparse string of IDs and/or multi column representation? This would be helpful for use with external tools that accept sparse feature representation like SVM-light.

I think I have a solution for that:

Connect a "Rename" node to the ouport of the fingerprinter. Configure the "Rename" node to change the type of the fingerprint column to be StringValue (select it in the respective combo box) - that will apply a toString() to the underlying bit set to output "101001011...".

If you then want to go further and convert the "101001011..." to something like "0 2 5 7 8" (indices of bits being set), you can use the "Java Snippet" node. Configure it to append a new column, whose type is "String" and enter the following expression in the text area

String toString = $fingerprint$;
StringBuilder out = new StringBuilder();
for (int i = 0; i < toString.length(); i++) {
  if (toString.charAt(i) == '1') {
    out.append(i);
    out.append(' ');
  }
};
out.toString()

If I understand you correctly that is sort of the appropriate output?

I do not think a node that converts the "101001011..." strings to a set(!) of columns containing the indices is the way to go because that would mean we need to deal with rows of different length.

Great! It works and is exactly what I wanted :D. The Java Snippet looks very powerful. I’ll see if I can learn some Java syntax and use it for other stuff as well.

Thanks a lot!

Hi there,

This thread answered about the same questions I had. However, I would like to apply mining>clustering tools to the fingerprints as well. Is this currently possible?

Thanks,

Peter

Peter,

peterem wrote:

This thread answered about the same questions I had. However, I would like to apply mining>clustering tools to the fingerprints as well. Is this currently possible?

The current hierarchical clustering node can only cluster based numerical values. However, we are currently developing a new node that is able to cluster on almost any type of data, but that is still somewhat experimental. It may be in the next major release.
So the answer is no, not yet.

Regards,

Thorsten

Dear Thorsten,

Thanks for your amazing fast reply.
I must say that I'm quite impressed by KNIME.

Thanks,

Peter

Peter, I just starting testing out KNIME and actually wanted it to fingerprint and then cluster the fingerprints as well.

And I'm sure you have probably found a solution by now to create scores from fingerprints, but if you like I have some java code that does that.

wiswedel wrote:

So where can you use it? We have successfully used the generated fingerprints in conjunction with the Neighborgram node (additional plugin, see here: http://knime.org/download_extensions.html#neighbor): It uses the fingerprints (along with the tanimoto distance) to construct neighborgrams, which may help to identify groups of similar molecules. Some other implementations of learning algorithms/nodes may support fingerprints as descriptors in the future as well. Feel free to contribute.

How exactly do I implement this with the neighborgram? What should the input be? Fingerprints in StringValue column alongside the tanimoto scores matrix? Sorry for the large number of questions! I'm stopping now!

Omar

Hi Omar,

ohaq595 wrote:

How exactly do I implement this with the neighborgram?

I agree, it's not obvious to get that to work. Here's what you need to do:

1) Create a column containing bit vector cells, i.e. (binary! - hope that's what you have) fingerprints (see below)
2) Connect a "Universe Marker" node to it and define the -what we call- universes
3) Connect it the neighborgram node

Regarding 1) We use BitVectorValues to represent (binary) fingerprints. You can use the bit vector generator node to construct those from different formats, for instance from a set of columns containing the 0 and 1 or from a StringValue containing the hex-representation or the printed fingerprint ("01101010101"). I hope, you have your data in one of these formats. We do not have yet a node that reads the "tanimoto scores matrix" (assuming that this is the dissimilarity matrix?). Supporting that sort of data is on our to-do list (among many other things), and it does also have high priority!
Please note that the representation of the BitVectorCells in the table view is a bit cryptic (it only prints the cardinality of the fingerprint) - but that's something which has been discussed earlier in this thread...

The "Universe Marker" node in bullet point 2) is used to annotate the column containing the BitVectorCell as a separate universe (my research project is on mining data in different descriptor spaces simultaneously, each descriptor space (e.g. fingerprint, vector of scalars, ...) is called universe). In the dialog of the node enter a name for the universe in the textfield on the top and then move the fingerprint column in to the include list.

The annotated output table of that node is then interpreted by the Neighborgram node (Point 3)). The dialog should now offer you to use that fingerprint to compute similarity values. (It should have the name of the universe displayed somewhere in the dialog.). Also pick "TANIMOTO" as distance measure and let it run.

The Neighborgram node as it is available right now is only a first prototype. It's still under development (and since it's closely related to my research interests it's under heavy development). However, it should give you an impression of what's in the data and the cluster identification and the brushing capabilities are implemented.

Please keep asking if you encounter any difficulties or problems.

Best regards
Bernd

Hi Omar,

I just started to do some coding out side of knime to create another type of input. Using the joiner I linked the structure data with my fingerprints. actually started with Kier Hall descriptors but I'm not to happy with the results so I'm back to the fingerprints again.

However, I think the current available cluster algorithms are not appropriate for this. I guess what we want is to compare the number of overlapping bit's using tanimoto. The current clustering algorithms seems to be calculating averages for a bit so that's not what we are looking for.

Anyway, I'm sure your code would be a great help. I justed started with Java programming because I decided to get more involved into Knime, CDK and so on. I would really appreciate it if you can share it.

Thanks,

Peter

wiswedel wrote:

We do not have yet a node that reads the "tanimoto scores matrix" (assuming that this is the dissimilarity matrix?). Supporting that sort of data is on our to-do list (among many other things), and it does also have high priority!

Yeah - Similarity or dissimiliarity. It's a score from 0 to 1. Higher being similar. Actually, KNIME already kind of supports the matrix, I believe.

I inputed in a N by N scores matrix into the hierarchical clustering node. And it gave me the cluster tree. The matrix was diagonally symmetrical, and it worked fine. Its supposed to right? I haven't tried the hierarchical node with an assymetric matrix (not all scores need to have i=j and j=i. Or just the top diagonal yet. Are those being considered?

wiswedel wrote:

Please note that the representation of the BitVectorCells in the table view is a bit cryptic (it only prints the cardinality of the fingerprint) - but that's something which has been discussed earlier in this thread...

Right. I read it earlier. Either way works for me now that you explained what it is.

Thanks for the help. The neighborgram is coming out well. I had to include a StringValue column into the Universal Marker node for the NeighborGram node to work properly. Now I'm just going to read your paper and figure out exactly what I'm looking at and how to interpret it ! :)

Hi Peter,

peterem wrote:

However, I think the current available cluster algorithms are not appropriate for this. I guess what we want is to compare the number of overlapping bit's using tanimoto. The current clustering algorithms seems to be calculating averages for a bit so that's not what we are looking for.

Makes sense. Though my work just requires me to figure out if molecules are really similar or really dissimilar. Hierarchical clustering following tanimoto score generation works fine for me. Scores less than 0.3 and scores higher than 0.7 are supposed to be interesting according to various people. The clustering of molecules in the middle range is usually garbage and shouldn't make sense.

What averages? Averages of the tanimoto scores?

peterem wrote:

Anyway, I'm sure your code would be a great help. I justed started with Java programming because I decided to get more involved into Knime, CDK and so on. I would really appreciate it if you can share it.

Not a problem. Java is really easy to learn and fudge around with. I was going to set up a webpage for my scripts. That may not happen today though. Faster way is if you email me and I can send it over to you. omar dot haq @ gmail dot com

Hi Omar,

ohaq595 wrote:

Yeah - Similarity or dissimiliarity. It's a score from 0 to 1. Higher being similar. Actually, KNIME already kind of supports the matrix, I believe.

I inputed in a N by N scores matrix into the hierarchical clustering node. And it gave me the cluster tree. The matrix was diagonally symmetrical, and it worked fine. Its supposed to right?

The simple answer is No! The cluster tree that you have constructed is based on a distance measure on the vectors of similarity values. It does not use the individual values in the table. (It's probably the Euclidean distance between all the similarity values). So it's in some sense also a distance measure but you would need to argue why it is a good one :wink: The hierarchical clustering node in the current version does not interpret similarity matrices, sorry.

ohaq595 wrote:
I haven't tried the hierarchical node with an assymetric matrix (not all scores need to have i=j and j=i. Or just the top diagonal yet. Are those being considered?

Not in the current version. But it's good to know that there is that sort of (strange) data once we start implementing support for similarity matrices.

Regards
Bernd

Hi Bernd,

wiswedel wrote:

The simple answer is No! The cluster tree that you have constructed is based on a distance measure on the vectors of similarity values. It does not use the individual values in the table. (It's probably the Euclidean distance between all the similarity values). So it's in some sense also a distance measure but you would need to argue why it is a good one :wink: The hierarchical clustering node in the current version does not interpret similarity matrices, sorry.

Ah! Thanks for the clarification! That's a great point. I wasn't sure what the node was doing. (Somehow my KNIME doesn't show me the node descriptions in the browser window. Will that description explain the ideal input to a node?)

I won't argue! :) Though in essence, my tree would still look the same, but the distance measure is not what I assumed, so I'm not looking at the correct output.

So what would the ideal input to the 'hierarchical clustering node" look like? Just a simple list of distances?

ohaq595 wrote:
(Somehow my KNIME doesn't show me the node descriptions in the browser window. Will that description explain the ideal input to a node?)

Do you run Linux? We know that there is a problem with the help window if some mozilla libraries are not installed or some environment variable is not properly set. Check out the FAQ on this.

Quote:

So what would the ideal input to the 'hierarchical clustering node" look like? Just a simple list of distances?

The node calculates the distances itself. Ideally, you feed it with a table whose rows contain numeric descriptions of the objects. It will then use these attributes and calculate the Euclidean distance on them (didn't check it but most probably you can configure it to use some other typical distance function).