HI, I am wondering if someone knows what the parameters in the naive bayes learner node mean. The standard parameters in the binary classifier I am running performs very well on the actives and very poorly on inactives despite the dataset being very imbalanced in favor of inactives. I am looking to optimize the parameters to see if it can perform well on both with a given value for the 3 parameters, but i can’t seem to find what they mean. The 3 parameters I am referring to are default probability, minimum standard deviation, and threshold standard deviation.
Hi @Haseeb23 -
The parameters are described in the node description of the pane, but you can also find the description on the Hub page for the Naive Bayes Learner node as well (click on the options tab).
Is there a reason why you want to use Naive Bayes in particular, rather than a different algorithm? Naive Bayes requires some fairly strict constraints on your data to perform well.
I was using a Naive Bayes because with the standard settings, it correctly predicts all of the actives in my dataset. The dataset is unbalanced, so without any modifications to the training set this is very difficult to achieve. I am looking to see if I can get a good model with the parameters for the Naive Bayes optimized to predict the inactives also for my dataset. I am just looking to see if there are any good models I can come up with before resorting to training set modifications with oversampling/undersampling. In addition to the Naive Bayes, I am using a SVM to accomplish this. What would you recommend?
I think it is unlikely that you will be able to get a model that predicts both classes to your satisfaction by only tweaking the Naive Bayes parameters. I would be inclined to try an undersampling approach here, along with other algorithms apart from Naive Bayes - maybe a few different tree-based methods?
I have tried to use an RdKit Diversity filter to reduce the majority class (I am working with chemical compounds). I have also tried to use smote to increase the minority class to equal the majority class and even to surpass the majority class. Nothing seems to be giving the results I am hoping to obtain. Thus far, I have used this method on the tree based methods as you suggested. I am getting an accuracy of 70% for each group of data, which I am hoping to increase. The AUC isn’t too bad, it is above 0.70 after using the binary classification inspector to change the threshold.
Try a parameter optimization loop. Not sure if necessary but did you scale?