double-looping to create multi-fusion similarity maps

weekiang · April 15, 2012, 7:25am

I am trying to re-create multi-fusion similarity maps (see link below for details).

http://onlinelibrary.wiley.com/doi/10.1111/j.1747-0285.2007.00579.x/abstract

The method requires each column in the query compound library to be compared to each column in a reference compound library (column by column) before moving on to the next column in the query compound library. There are about 10 thousand columns in each compound library but the number of columns are not exactly the same. Each column contains 1000 binary fingerprint keys in separate rows. For each such comparison, a Tanimoto Score is calculated. This essentially means that a loop-within-loop construct is neccessary. I've tried to use the "Column List Loop Start" to loop through each column in the query compound library and channel its output into the input node of the "Fingerprint Similarity" node (Erl Woods Chemoinformatics). In the "Fingerprint Similarity" node, I have set the flow variable "columnName" to "currentColumnName". The "Multi-query fusion" option was ticked. The reference compound library was channeled into the "Fingerprint Similarity" node without any looping node attached to it. Therefore, the end result is that the workflow loops through every column in the query compound library but only compared it with the first column from the reference compound library. Even then, the loop ended with an error message: Loop End (Column Append) Execute failed: Java heap space. Also, I've tried attaching a corresponding "Column List Loop Start" node to the reference compound library but it only looped in tandem with the other one. I also noticed that the Multi-query fusion did not seem to have any effect.

Here are my specific questions:

How do I create the double-looping structure so that every column in one compound library can be compared in a pairwise manner with every column in the reference compound library?
How do I collect all the Tanimoto score from each iteration into 1 output file instead of just getting 1 line containing the Tanimoto score?
How do I use the "Muti-query fusion" option correctly?

Thanks!

mfsmap.png

BJFR · April 15, 2012, 7:45pm

weekiang,

can you post some extract from your two csv, it could be of some help.

B

weekiang · April 16, 2012, 5:05am

This is a truncated version of the 2 csv files.

Query compound library.csv

Name,MACCSFP1,MACCSFP10,MACCSFP100

1,0,0,0

18,0,0,0

Reference compound library.csv

Name,MACCSFP1,MACCSFP10,MACCSFP100

781,0,1,0

13835,0,1,0

The actual csv files contain about 1000 binary fingerprint columns and 10000 rows. But each csv file contains a different number of rows although they contain the same number of columns. They were later transposed in the workflow, resulting in 10000 columns and 1000 rows before channeling them into the Fingerprint Similarity node.

BJFR · April 26, 2012, 6:26pm

hi,

not being an expert in bitvector node, I did not pursue in this direction of yours. However, I have designed a short workflow that can do what you need (I hope so anyway!).

Were created before doing anything else, a query and reference table, each containing IDs, Smile, FP1, FP2 and FP3.

the WF attached does the following

create the REF x QUERY matrix (in our case 9x4=36)
compute PAIRS id by combining each ids (36 PAIRS)
unpivot each table so that only one FP column remain
compute the FP type (to avoid mixing FP types when calculation time is due)
join each table using both PAIRS and FP Type
compute the Tanimoto for each Pair within each FP type (108 lines)
unpivot the final table so you got PAIRS, ID que and ID ref as wall as tanimoto in each FP type.

hopefully this will get you one step forward, by optimizing and adapting this to your needs.

regards.

Bruno