I’m building a model using linear regression, simple regression tree, and random forest models to predict housing prices (Ames, Iowa). (I don’t want to change the prediction models as I want to use this as a first limitation).
Another limitation is that I am allowed to use max 3 features to predict SalePrice.
The problem is that I don’t get a RMSE that is lower than $30.000 which is quite significant for houses (is equal to 13.7% mean absolute percentage error) (achieved by random forest predictor).
What are good ways to optimize this model? The random forest model is also a bit overfitted but I couldn’t improve on this side neither.
What are some ‘quick wins’ to optimize random forest predictions?
What I’ve already tried: Normalizing my data (z score or min max), cross validation (not yielding any improvements somehow), optimization loops regarding tree depth and number of trees.
Thank you in advance for your help as I’m quite new to this platform.
Indeed 3 is not so much. But anyway see this workflow from the KNIME Hub . By using the Python packages “Itertools” this workflow makes it possible to loop over all all possible combinations of columns (you can make a selection, in this flow =3) to train a model (simple regression)
In this workflow a model is trained on the different combinations of features in the Boston Housing dataset.
Hi @StfnS The model is in the Metanode “Model and Scors” . the 3rd from the right. To open it, just double click . There is a Random Forest Learner (Regression)
You can replace the Learner and the Predictor. I think adjusting some parameters within the Learner needs some attention,.
gr. Hans
Ah @StfnS, you can drag and drop the wf directly from the hub (just follow the link in my previous post). Or just download it from this link Control variables in a loop.knwf (485.7 KB) .
gr. Hans
Question would be what to select from three variables …
One thing you could try is employ a tool like vtreat (or featuretools in python) and see if this could make some use of some transformations. But typically this is used to reduce dimensions of you have ‘too much’ data.
As it happens the example used is also about house prices. You could also try and limit the automl set of models to the ones you want to use and see with what H2O.ai does come up.