Exploitation of SDF as text to get distance and dihedral angles

avalery · June 15, 2022, 10:16am

Dear community,

I am trying to exploit SDF structure on a “flagged library”. This library is composed of many scaffolds having 2 Iodines “flagging” the exit vector positions.

I have 2 tasks on this library:
1) Get the distance between the 2 Iodine atoms.
2) Get the torsion angle between the 2 exit vectors.

I have managed to extract the coordinates of my 2 Iodines atoms to get the distance with a not very elegant succession of string manipulation nodes. So “1)” is done.

Regarding 2), I would really appreciate some help. Each molecule has the following structure: a first block (here in bold) of atomic coordinates and the second block of bond description. The first block has X Y Z coordinates and then the atomic element followed by characterisations of that element that I do not need for that applications (mostly 0s and somes 1s). The second block is composed of 4 digits of which only the first 2 are important. They describe which atoms are linked together by their numbering which is determined by their row number in the first block).

I need to:
=> numbering the row inside that string cell
=> identify the row in which the 2 iodines are located (row 4 and 27 in the example below)
=> go to the bond description block and find, within the first 2 digits of each row, either 4 or 27
=> Extract the other “atom number” of each pair number (here 1 and 22, in bold as well)
=> Use these 2 new numbers to extract the coordinates of the corresponding atoms in the atom block

The output should be essentially 12 new columns, Iodine 1 X, Iodine 1 Y, Iodine 1 Z, Partner 1 X, Partner 1 Y, Partner 1 Z, Iodine 2 X, …, Partner 2 Z.

-3.2355 0.3926 -0.4140 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.6356 0.5275 -1.4262 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.9300 -0.2818 0.1017 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.2674 2.2798 0.5824 I 0 0 0 0 0 0 0 0 0 0 0 0
-1.8583 -0.2889 -0.5085 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.8599 0.4614 -1.4096 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.2448 -0.6855 0.8519 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.0636 -1.2428 -1.0191 H 0 0 0 0 0 0 0 0 0 0 0 0
0.5176 -0.1875 -1.3143 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.2118 0.4293 -2.4476 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.7797 1.5182 -1.1341 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0077 0.1174 1.1940 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9692 -1.7480 0.8102 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.9802 -0.5989 1.6596 H 0 0 0 0 0 0 0 0 0 0 0 0
1.0864 -0.0498 0.1058 C 0 0 0 0 0 0 0 0 0 0 0 0
1.1897 0.2611 -2.0546 H 0 0 0 0 0 0 0 0 0 0 0 0
0.4273 -1.2487 -1.5805 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.2544 1.1767 1.2948 H 0 0 0 0 0 0 0 0 0 0 0 0
0.3840 -0.1982 2.1746 H 0 0 0 0 0 0 0 0 0 0 0 0
2.2664 0.9663 0.2513 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1938 -1.1047 0.4241 C 0 0 0 0 0 0 0 0 0 0 0 0
2.9531 -0.0312 1.1018 N 0 0 0 0 0 0 0 0 0 0 0 0
2.0249 1.9064 0.7589 H 0 0 0 0 0 0 0 0 0 0 0 0
2.7648 1.2133 -0.6959 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6647 -1.5355 -0.4704 H 0 0 0 0 0 0 0 0 0 0 0 0
1.8868 -1.9290 1.0766 H 0 0 0 0 0 0 0 0 0 0 0 0
4.9233 -0.1196 0.8764 I 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
1 3 1 0
1 4 1 0
1 5 1 0
5 6 1 0
7 5 1 0
5 8 1 0
6 9 1 0
6 10 1 0
6 11 1 0
12 7 1 0
7 13 1 0
7 14 1 0
9 15 1 0
9 16 1 0
9 17 1 0
15 12 1 0
12 18 1 0
12 19 1 0
15 20 1 0
21 15 1 0
20 22 1 0
20 23 1 0
20 24 1 0
22 21 1 0
21 25 1 0
21 26 1 0
22 27 1 0
M END

$$$$

I would really appreciate any help! Also if someone knows how to match for any letter in the sting manipulation node that would also allow me to finish the project, separating the atom block from the bond block by doing a backward search for the last letter and then getting the “row numbers” with a \n cell splitter.

Cheers,
Alain

badger101 · June 15, 2022, 10:59am

Hi @avalery can you elaborate more on those lines you wrote?

Also, I have 2 additional queries:

The two blocks; are they inside the same table in KNIME? Or do they belong to different tables? Perhaps you can upload the dataset.
You mentioned ‘numbering the row inside that string cell’. Not sure which cell you’re referring to, which brings me back to my first question about how the dataset is structured in a tabular form.

Thanks!

avalery · June 15, 2022, 11:27am

Hi @badger101,

Sure. Essentially you have 4 atoms involved in the 2 exit vectors. In this example it is one C-I bond and one N-I bond.

This means that you have 4 “lines of interest” in the atom block. Here, the lines that interest us are 1 and 4 as the first pair and 22 and 27 as the second pair.
1&4
-3.2355 0.3926 -0.4140 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.2674 2.2798 0.5824 I 0 0 0 0 0 0 0 0 0 0 0 0

22&27
2.9531 -0.0312 1.1018 N 0 0 0 0 0 0 0 0 0 0 0 0
4.9233 -0.1196 0.8764 I 0 0 0 0 0 0 0 0 0 0 0 0

These line numbers are found by

searching for the lines containing “I” (letter i in capital). Here 4 and 27
Figuring out to which atoms they are connected. Here 1 and 22 (these are the 2 “new numbers” I was referring to)

Once you have figured out which atom are connected to Iodine, you need to create the 2 vectors based on the two pairs of atoms.

So, with Partner 1 being on line 1 of the atom block, and Iodine 1 being on line 4 of the atom block and Partner 2 being on line 22 and Iodine 2 being on line 27:

Vector 1 = sqrt((Iodine 1 X - Partner 1 X)^2+(Iodine 1 Y - Partner 1 Y)^2+(Iodine 1 Z - Partner 1 Z)^2)
Vector 2 = sqrt((Iodine 2 X - Partner 2 X)^2+(Iodine 2 Y - Partner 2 Y)^2+(Iodine 2 Z - Partner 2 Z)^2)

Yes, they are even in one single cell, making the “row numbering” part a hassle.
I have attached a screenshot to show my issue ^^

Thank you !

Best,
Alain

badger101 · June 15, 2022, 11:34am

Okay, I understand it now. I’ve never worked with SDF Nodes before, but I’ll have a look at what I can do.

avalery · June 15, 2022, 11:39am

Thanks! By any chance, do you know of any way to “search for any letter” in the string manipulation node?

badger101 · June 15, 2022, 12:30pm

Hi @avalery I’ve tried using an SDF file from one of the workflows in the hub, but it seems like your bond block has a different format. Here’s what I have (after converting it to a workable table):

bond block

Because of the different formats, it wouldn’t matter what I do since my workflow won’t be applicable to your dataset.

Unless you upload your dataset (or even better, the workflow itself), I can do nothing further.

avalery · June 15, 2022, 12:56pm

Hi @badger101,

My SDF are generated with RDKit, I guess that can explain the different format. I am sorry but I am not allowed to share the workflow. I cannot attach SDF as it is not authorized by the website. I saved it as text if you would like to take a look.

Cheers

To Share.txt (1.4 MB)

badger101 · June 15, 2022, 1:01pm

Unfortunately a .txt file will appear differently when being read in KNIME. It’s a totally different thing. I’m afraid this is where we diverge our paths. Wish you all the best & hope you found the solution!

Cheers!

avalery · June 15, 2022, 1:04pm

Thank you for your time @badger101

Cheers,
Alain

elsamuel · June 15, 2022, 6:34pm

Hi @avalery

My solution to this problem would be to use some RDkit functions in a Python scripting node.

First I extract the CTAB from the SD file. Note that “Extract CTab blocks” is selected.

Then working on each molecule individually, I parse the CTAB block to find the atom indices of the iodine atoms and the connected carbon atoms, and convert the atom indices to flow variables:

Then I

get coordinates of the I atoms using the RDKit function GetAtomPosition, then use the NumPy function linalg.norm to get the Euclidean distance
get the dihedral angles using the RDKit function GetDihedralDeg

This is the completed workflow:

Here are the results from the first 10 molecules:

I confirmed the results for a few of them using Discovery Studio and I think it’s working as expected. This is the first molecule:

I’ve uploaded a copy of the workflow here:

For this to work you need Conda installed and configured correctly on your local machine.

avalery · June 16, 2022, 7:22am

Hello @elsamuel,

Thank you so much! I will try to use it and set up anaconda!

Yesterday I managed to get to isolating coordinates for all the involved atoms but I am using so many Nodes … This solution seems so much more elegant!

Best
Alain

avalery · June 16, 2022, 2:04pm

It works perfectly and looking at the workflow is a great learning experience:

The cellsplitter “as list” followed by Ungroup is also very nice! I was doing “as new column” then unpivot.

The table manipulator node is awesome, I didn’t knew it existed, I would always column filter then column rename.

Thank you!

elsamuel · June 16, 2022, 2:37pm

@avalery I’m glad it helped!

system · June 23, 2022, 2:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.