Generating Bit Strings using SMARTs query node

I have a database of chemicals and want to evaluate each chemical against a list of SMARTs patterns also from a database.  The goal is to generate my own BitStrings for each chemical that relate to the SMARTs patterns.  I have successfully done this BUT I can only do it one SMARTs pattern at a time (attached workflow).

Ideally, the workflow would be to run each chemical from one database table through all the SMARTs patterns (ie: 1-10) from another database table and generate a concatenated bit-string like "1010101111".  Then do a database update for each chemical bit-string.  Each position in the bit-string means match(1) or non-match(0) relating to the SMARTs ID (1-10) in the database.

I've tried to use Chunk Loop Start on the SMARTs patterns but this doesn't allow the Bitstring to be looped back so that it can be updated (at a particlur bit position) using the String manipulation node.

I downloaded the example workflows for the Recursive loop nodes but have had limited success.  I must be missing something but this seems like a doable task for KNIME.  In essence, I need to do a nested FOR loop like this:

for x=1 to numChemicals
     for y=1 to numSMARTs
         run SMARTs query and generate BitString using String manipulation node
     end
end

My recommendation is to use the RDKit nodes instead.

The node which should deliver what you need is;

RDKit Substructure Counter (with Add Column with tags for matching queries selected).

 

You can then use the Column Aggregator node to Aggregate all these 0 and 1 columns together using Concatenate option in the Options tab, and make sure to clear out the Value Delimiter box.

Simon.

Thanks for the input Simon.  Your instructions were clear and concise. 
I modified the SQL query that retrieves the SMARTs to have a prefix:
SELECT concat("SubStruc-",FGroupName) AS FGroupName, SMARTS FROM tbl_ChemFuncGroups

This allowed me to select columns in the aggregator by wildcard(SubStruc*) avoiding the need to manually add each column as my list of SMARTs grows.

I didn't foresee the value of having "counts" makeup the bitstring as opposed to true/false.  The only problem arises when a count is greater than 9; the bit will be 2 chars (unless I convert to something like a 36-base  0-9-A-Z).

Do you have any suggestions on how to handle the possibility of a substructure count >9? For most of my substructures, they will not be present more than 9, but it's possible. 

I could also delimit the bitstring with a comma as this would be adaptable in my application.

-David

Hi, if you use a comma delimiter, in the column aggregator node, you could follow up your resultant combined column with a string replacer using RegEx to replace all values greater than 0 with a 1, setting the option as a pattern rather than the whole string. For example;

[2-9]          and then in replace box put a 1

then in a second string replacer node use; (note commas are deliberate here)

[10|11],       and then in replace put a 1,

You should now be left with just 0's and 1's,

 

if if you finally want the commas out you could use a third string replace node with , in the find box and replace box as empty. 

Simon. 

Thanks Simon.
Using string replacer node was a good idea.

I was able to use 1 string replacer node to capture counts 1-99 using the regex expression:

([1-9]{2}|[1-9][0]|[1-9])

1-9 in the 2nd digit OR
1-9 followed by a 0 OR
1-9

The order of this regex pattern matters; putting [1-9] at the beginning will negate the [1-9][0] condition.

Thanks!

David

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.