Housing Prices Prediction

StfnS · November 22, 2020, 7:07pm

Hi,

I’m building a model using linear regression, simple regression tree, and random forest models to predict housing prices (Ames, Iowa). (I don’t want to change the prediction models as I want to use this as a first limitation).

Another limitation is that I am allowed to use max 3 features to predict SalePrice.

The problem is that I don’t get a RMSE that is lower than $30.000 which is quite significant for houses (is equal to 13.7% mean absolute percentage error) (achieved by random forest predictor).

What are good ways to optimize this model? The random forest model is also a bit overfitted but I couldn’t improve on this side neither.

What are some ‘quick wins’ to optimize random forest predictions?

What I’ve already tried: Normalizing my data (z score or min max), cross validation (not yielding any improvements somehow), optimization loops regarding tree depth and number of trees.

Thank you in advance for your help as I’m quite new to this platform.

Daniel_Weikert · November 23, 2020, 6:37pm

3 is not much. What about different features?

HansS · November 23, 2020, 8:02pm

Hi @StfnS

Indeed 3 is not so much. But anyway see this workflow from the KNIME Hub . By using the Python packages “Itertools” this workflow makes it possible to loop over all all possible combinations of columns (you can make a selection, in this flow =3) to train a model (simple regression)
In this workflow a model is trained on the different combinations of features in the Boston Housing dataset.

gr. Hans

StfnS · November 23, 2020, 8:20pm

thanks for the idea. Since I’m not that advanced in Knime yet, would it be an option for you to upload your model?

That way, I can learn faster how to adjust my current model.

StfnS · November 23, 2020, 8:22pm

Yes, indeed. Is there a way to automatically find the best 3 combinations of features that minimizes RMSE using feature selection? If yes, how?

HansS · November 23, 2020, 8:28pm

Hi @StfnS The model is in the Metanode “Model and Scors” . the 3rd from the right. To open it, just double click . There is a Random Forest Learner (Regression)
Screenshot from 2020-11-23 21-25-51
You can replace the Learner and the Predictor. I think adjusting some parameters within the Learner needs some attention,.
gr. Hans

StfnS · November 23, 2020, 8:31pm

Hi @HansS .

Sorry, I was referring to your complete workflow (with the python script, etc.). I think it is called workflow (not model).

Would be insanely helpful to get my hands on it!

HansS · November 23, 2020, 8:37pm

Ah @StfnS, you can drag and drop the wf directly from the hub (just follow the link in my previous post). Or just download it from this link Control variables in a loop.knwf (485.7 KB) .
gr. Hans

StfnS · November 23, 2020, 8:39pm

ah, thank you! Somehow missed the first link

mlauber71 · November 24, 2020, 5:34am

Question would be what to select from three variables …

One thing you could try is employ a tool like vtreat (or featuretools in python) and see if this could make some use of some transformations. But typically this is used to reduce dimensions of you have ‘too much’ data.

As it happens the example used is also about house prices. You could also try and limit the automl set of models to the ones you want to use and see with what H2O.ai does come up.

system · May 25, 2021, 5:34pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.