How to treat outliers for classification analysis (decision tree, neural networks etc.)?

akfh17 · June 28, 2017, 7:49pm

Hello,

I am a beginner at knime and need a little bit of help. We are doing our first knime project and we want to predict something for our Project. Now the Question is how should we treat outliers. I mean for a prediction your model needs any information it can get. So we are not sure if we really should remove or subsitute all outliers, we identified with a box plot. We decided to bin a few variables with the numeric binner node, so we wouldnt have the problem for the outliers. But should we do this for all variables ? And how should we do this for the neural networks? Because this method doesnt accept any string variabes.

Thank you in advance!

Vincenzo · July 4, 2017, 3:54pm

Hi akfh17,

Could you please provide more information on what do you want to achieve? What are you trying to predict? Which kind of data do you have?

Please find below some info that might be useful based on what you've asked.

Most sophisticated methods in machine learning address the problem related to outliers. The decision tree algorithm is quite robust to outliers.

Did you already do an outlier detection of the variables that you want to consider? One possible technique to filter outliers is to use the JavaScript Box Plot node. After using this node you can filter out the outliers selected with the Row Filter node. You may find useful also this whitepaper which shows how to implement seven techniques for dimensionality reduction: https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

You are allowed to use only double values for MLP algorithm. For the PNN, the data does not need to be normalized. On the other hand, the Rprop (MLP) should have the data normalized.

Hope that helps,

Best,

Vincenzo