3class model with Morgan fp

Dear all,

I'm trying to build a Bayesian model with the Morgan fp and the fp bayesian learner in knime 2.4.1.

What is not really clear to me: What is the meaning of tghe target class? Does this correspond to a the number of bins? My data set is a three class data set:

http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel

Cheers & Thanks,

Paul

Thats correct, the target column is the column with the actual binned data present, the column you wish to predict when you have cpds without the data. So in this case it will be binned solubility data.

In the Fingerprint Bayesian model, The target class value is the actual value you desire, i.e. high solubility. The effect of this is besides getting a predicted bin outcome for your predicted dataset, you also get a score of how likely "high" solubility is, with a high score being good, negative score being unlikely.

Simon.

Dear Simon,

thanks for your answer!

Now let's come to the next questions :- )

- What is the meaning of the "target class value". In my case, I have the "solubility class" defined as "class column" which can have three different values. In the fp bayesian learner, I have to select one particular target class value.

- How do I define a three-class model? It appears to me that only a 2-class model is available by default

Cheers & Thanks,

Paul

Hi Paul,

Yes, the node solves only two-class problems, i.e. the one selected against the rest. In order to apply it to multi-class problems, you could (theoretically) use it in a loop, whereby you loop the different class values ("low", "medium", "high") and then collect the scores in the loop end. These will be named "Score (low)", "Score (medium)", "Score (high)" by the corresponding predictor node.

The algorithm is based on the article

Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases, Nidhi Meir Glick, John W. Davies, and Jeremy L. Jenkins, J. Chem. Inf. Model., 2006, 46 (3), pp 1124–1133

... they consider multi-class problems, whereby the prediction is done by calculating a two-class model for each of the classes and then predicting the class whose score is largest. The loop above should be equivalent.

Regards,
  Bernd

 Dear Bernd,

thanks for your reply! I'm struggling with the way the loop works in KNIME. I would have thought that the solubility class can be assigned a  loop variable, but this seems not to be the case.
Please see a screenshot of the workflow:
http://dl.dropbox.com/u/23897494/KNIME_loop_variable.png
 
Cheers & Thanks,
Paul

The Bayesian fingerprint model will give you a prediction in terms of which of the 3 solubility bins is predicted if you tick "Crisp Predictions" in the Bayesian Fingerprint node, but you wont get a score like you do for one specified value.

I find this is good enough,so ticking crisp predictions will give you a score of how likely the target class value is that you selected (i.e. high likely high solubility is), and it will also give a "crisp prediction", (i.e. the predicted bin value (low, medium, high).

Hope that is useful.

If you really want a score on each bin, then to do the variables bit, take the dataset and attach a GroupBy node and Group By "Binned Solubility", no aggregation columns needed. Now attach a Table Row To Variable Loop Start. Now attach rest ofyour workflow with the Bayesian Fingerprint nodes, and in the Learner node, goto Flow Control and in the Target Class Column, select the Binned Solubility variable from the dropdown. After your prediction node, complete the loop with a Loop End node. I hope this helps.

Simon.

Dear Simon,

thanks for your hints on the CrispPrediction, they are really helpful!

However, I'm struggling with the second part of your answer: When GroupingBy the SolubilityClass, I can not pass over the flow variable to the Bayesian fingerprint learner. Please check this screenshot: http://dl.dropbox.com/u/23897494/KNIME_groupby.png

Cheers & Thanks,

Paul

What you need to do is connect the red blob of the Tablerow to Var node up to the 1st red blob (left blob) of the Bayesian Learner node. Then connect the main dataset (i.e. just before the GroupBy node, so connect the output from the node prior to the Groupby node) to the input arrow of the Bayesian Learner node.

Now in the Bayesian Learner node, go to Flow Variables tab, and next to TargetClass, from the dropdown you should be able to select the variable (which I believe will be called SOL_classification in your case). The connect up your Bayesian Predictor in the usual way, and at the end of your workflow put a Loop End (Column Append).

The attached workflow snapshot may help.

Hope this helps, let me know if you are still struggling. Takes a while to get used to controlling variables.

Simon.

Hi Simon,

This is certainly helpful. I'm working on a similar problem, trying to classify mouse liver microsome stability data into 3 classes (low, medium, high) using RDKit fingeprints as descriptors. This seems to work quite well based on the Bayesian scores I get for each of the classes.

However, when ticking the 'Append Crisp Class prediction' option in the Predictor node, I get 3 class prediction columns (one for each iteration). These however either contain just one or 2 classes, but never 3 (1st iteration 1 class, 2nd iteration 2 classes, 3rd iteration 1 class). The scores are not affected. The predicted classes for each iteration are dependent on the sorting order of the Grouping node where the classes are defined (scores are not).

How can I get one overall predicted class columnd out of this?

Thanks!