Applying higher misclassfication costs to decision trees for imbalanced datasets

Hi,

I have a highly imbalanced dataset and need to apply higher misclassfication costs to improve the minority class in my predictions. I am carrying out categorical binary classfication where i need to improve the false positive (minority class prediction).

I am aware that WEKA has this capability (cost senstive classfier) however i keep on getting error messages and was hoping there might be any other nodes out there which i could use.The error message i get is:

Execute failed: Length of probability estimates don't match cost matrix

If someone could post a workflow with a succesful use of these weka nodes with cost applied that would really help me too!

Additionally are there nodes out there which allow me to pick the features that i put into each level of my trees. I.e i want to always select feature 1 at the top of my trees as it might not be picked by algorithm for the first split of my ddecision tree. This is a type of interactive decision tree learner and was wondering if there were any nodes out there for this?

I was also thinking that prehaps i could use R integrated nodes to carry this out so any input on this would be helpful too!

Thank you for any help,

Danielle

Hi Danielle, 

There isn't a way to do this directly in KNIME (or the Weka nodes), but there are a couple of options to address class imbalance.  

First, you could try to use an equal size sampling node to downsample your over abundant measurements.  

Alternatively, it is also possible to oversample using the Smote node, which is a bit more sophisticated than just dropping some rows, but will come at a small performance cost of having to add data to your training set.  

Finally, in R, some algortihms support a weights parameter in the model building call.  In these cases you will need to look at specific implementations for details but it's often just an additional argument like: weights = knime.in"weights".

Best regards,

Aaron