Dictionary function

Hi there,

Just thought I would share with the community another bit of java that I have written, which I think is pretty useful. I have been looking at KNIME as an ETL (Extraction Transforamtion Loading) tool, and as such, I have come across situations where the input data has to be normalised to my naming standard. In order to do this I created a java snipnet which looks against a set of dictionary terms, and then applies my naming standard. The following code example shows the normalisation of a series of chemical salt forms.

The code I used is below;

// get querystring from column
String qStr = $salt_ID$;

String dStr = “”;
String outStr = “”;
int sPos = 0;
int index = 0;

// 2d array to hold dictionary terms
// Dictionary enteries are held in the following format
// {"","",""}

String dictArr [] [] = {
{“NO_SALT”,"",""},
{“FORMATE”,"!Formic Acid!methanoic acid!Methanoic acid!",“O=CO”},
{“CITRATE”,"!Citric Acid!",“C(C(=O)O)C(CC(=O)O)(C(=O)O)O”},
{“TFA”,"!Trifluoroacetic acid!Trifluoroethanoic acid!Perfluoroacetic acid!tfa!",“FC(F)(F)C(O)=O”},
{“MALEATE”,"!Maleic acid!",“OC(=O)C=CC(=O)O”},
{“HCL”,"!HCl!H C l!",""},
{“HBR”,"!HBr!H B r!",""},
{“OXALATE”,"!O x a l i c a c i d!Oxalic acid!ethanedioic acid!",“OC(=O)C(O)=O”},
{“TETRAFLUOROBORATE”,"!Tetrafluoroborate!",""},
{“FUMARATE”,"!F u m a r i c a c i d!Fumaric acid!",“OC(=O)C=CC(=O)O”},
{“ACETATE”,"!Acetic acid!Ethanoic acid!",“CC(=O)O”},
{“TARTRATE”,"!L - ( + ) - T a r t a r i c a c i d!L-(+)-Tartaric acid!2,3-dihydroxybutanedioic acid!",“C(C(C(=O)O)O)(C(=O)O)O”},
{“MALATE”,"!Malic acid!hydroxybutanedioic acid!",""}
};

// remove any trailing spaces from query string
qStr = qStr.trim();

// Normalise query string to lowercase
qStr = qStr.toLowerCase();

// remove any white spaces from query string
qStr = qStr.replace(" “,”");

// add delimiter character to each side of query string
qStr = “!” + qStr + “!”;

// loop through dictionary array
for (int i = 0; i < dictArr.length; i++)

{
// get dictionary search string
dStr = dictArr [i] [1];

//remove any trailing whitespaces
dStr = dStr.trim();

//normalise dictionary search string to lowercase
dStr = dStr.toLowerCase();

//remove any whitespaces from dictionary search string
dStr = dStr.replace(" “,”");

//find position of query string (qStr) in dictionary string (dStr)
sPos = dStr.indexOf(qStr);

// return index position in dictionary array of query string
if (sPos > 0 ){

//found dictionary entry

index = i;

}
};

if(index==0){

// no dictionary entry found
outStr = “No Dictionary entry found”;

}
else
{

// dictionary entry found, return normalised DICTIONARY term
// normalised dictionary term is held in the array first element [0]
// change [0] for different dictionary items

outStr = dictArr [index] [0];

}

outStr = outStr;

outStr


In the above example, I am returning normalised compound salt codes for a number of chemical entities. So just by changing the order of the 2D array elements, the code can be used to identify salt forms from their respective SMILES strings, eg.

{"",""}

or return the SMILES string for the named salt, e.g.

{"",""}

etc.

One thing to remember is that each dictionary term needs to be delimited by a unique character, in the above example, ' ! '. The reason for this, is that if you used the query term 'acetic acid', from the above example, you would return;

Trifluoroacetic acid
Acetic acid

By adding the delimiter this is avoided.

This example was for compound salt forms, but can be used where any normalised naming convention is needed. Anyhow, hope this is of some use to you all, I apologise for the none optimised java, but I am really not a programmer. And of course if anyone has a better idea, then please share.

Best regards,
Stanage

These code is great, one good example, Thanks.