What preprocessing techniques should I use for Decision Trees and Random Forests?


I read that both were robust to errors and missing values.



If you can, it’s a good idea to deal with missing values prior to these algorithms, since they will treat missings as a class unto themselves. This is better than nothing, but depending on your use case, may not be ideal. So if you have enough data to impute missings via appropriate assumptions, you probably should.

The other big item to be concerned about is potential data imbalance (in the case of classification). If needed, you may want to implement SMOTE, or some other algorithm to handle random under/over sampling.

In addition, you can use standard techniques to remove correlated features, normalize variables (if necessary), and deal with extreme outliers - but these probably aren’t as urgent as the first two issues mentioned above.

Does that help?