I have a database of chemicals and want to evaluate each chemical against a list of SMARTs patterns also from a database. The goal is to generate my own BitStrings for each chemical that relate to the SMARTs patterns. I have successfully done this BUT I can only do it one SMARTs pattern at a time (attached workflow).
Ideally, the workflow would be to run each chemical from one database table through all the SMARTs patterns (ie: 1-10) from another database table and generate a concatenated bit-string like "1010101111". Then do a database update for each chemical bit-string. Each position in the bit-string means match(1) or non-match(0) relating to the SMARTs ID (1-10) in the database.
I've tried to use Chunk Loop Start on the SMARTs patterns but this doesn't allow the Bitstring to be looped back so that it can be updated (at a particlur bit position) using the String manipulation node.
I downloaded the example workflows for the Recursive loop nodes but have had limited success. I must be missing something but this seems like a doable task for KNIME. In essence, I need to do a nested FOR loop like this:
for x=1 to numChemicals
for y=1 to numSMARTs
run SMARTs query and generate BitString using String manipulation node
My recommendation is to use the RDKit nodes instead.
The node which should deliver what you need is;
RDKit Substructure Counter (with Add Column with tags for matching queries selected).
You can then use the Column Aggregator node to Aggregate all these 0 and 1 columns together using Concatenate option in the Options tab, and make sure to clear out the Value Delimiter box.
Thanks for the input Simon. Your instructions were clear and concise.
I modified the SQL query that retrieves the SMARTs to have a prefix:
SELECT concat("SubStruc-",FGroupName) AS FGroupName, SMARTS FROM tbl_ChemFuncGroups
This allowed me to select columns in the aggregator by wildcard(SubStruc*) avoiding the need to manually add each column as my list of SMARTs grows.
I didn't foresee the value of having "counts" makeup the bitstring as opposed to true/false. The only problem arises when a count is greater than 9; the bit will be 2 chars (unless I convert to something like a 36-base 0-9-A-Z).
Do you have any suggestions on how to handle the possibility of a substructure count >9? For most of my substructures, they will not be present more than 9, but it's possible.
I could also delimit the bitstring with a comma as this would be adaptable in my application.
Hi, if you use a comma delimiter, in the column aggregator node, you could follow up your resultant combined column with a string replacer using RegEx to replace all values greater than 0 with a 1, setting the option as a pattern rather than the whole string. For example;
[2-9] and then in replace box put a 1
then in a second string replacer node use; (note commas are deliberate here)
[10|11], and then in replace put a 1,
You should now be left with just 0's and 1's,
if if you finally want the commas out you could use a third string replace node with , in the find box and replace box as empty.
Using string replacer node was a good idea.
I was able to use 1 string replacer node to capture counts 1-99 using the regex expression:
1-9 in the 2nd digit OR
1-9 followed by a 0 OR
The order of this regex pattern matters; putting [1-9] at the beginning will negate the [1-9] condition.