Numeric Scorer gives a negative R^2 (R-squared)

RichardFL · January 31, 2019, 1:39pm

I’ve tried three different prediction models for my project, Linear Regression, XGBoost and Random Forest. They all gave me a negative R^2 value (see screenshot). I have clean data and no missing values.

What can I do to improve my model?

r-squared

RichardFL · January 31, 2019, 7:51pm

Any ideas regarding improving/fixing this model are greatly appreciated…
Data is clean, must be a setting issue i’m missing…!!

mlauber71 · February 1, 2019, 2:45pm

If you could tell us a bit about what your task is and what data you have we might get a better idea about what might help.
Also it would help a lot if you could provide us with sample data, but I understand that often this is not possible. You mention settings; could you tell us which settings you have in mind; which ones do you have used?

I can think of these points first to check the ‘surroundings’

you could do a Linear Correlation and see which variables do influence your target the most
use a model like the “H2O Gradient Boosting Machine Learner (Regression)” which gives you a list with the variable importance that also might give you ideas
how did you get rid of your missing values, did you do some replacement? You might need a missing replacement strategy
you could reverse engineer your current score by feeding the prediction score as a target together with the other variables into a decision tree

Measures you can take to further improve your features/variables

see if you have extreme values and maybe get rid of them or replace them
if you know of any ratios of important variables you might set them beforehand - that sometimes helps a lot
you might get some insights from business experts for some tweaking of the data
see if all your date variables are computed in a relative way. Eg. do not give the sale date but something like days since sale
get rid of any IDs or something
get rid of highly correlated variables that basically contain no additional information
try normalizing numeric values or use a logarithmic scale (or both)
you could employ a R package like vtreat to automatically improve your features
you could try and employ some techniques like Principal Component Analysis (PCA) to create more compact features
if you have categorical data you might use them to do some label encoding (if you can assume that the relationship will be stable in the future)
an example would be you might replace a customer’s auto brand with the average sale, because the amount someone buys in the future might be related and having that as a number makes it easier for some models to take that into consideration

You could try different model strategies in addition to your current ones, although the ones you use do sound good

maybe start with using models from the great H2O.ai nodes, they also do some feature engineering
the AutoML function from H2O would quickly give you an idea what kind of model might be useful, where to take a closer look
some neural network might help to see if it makes a difference, sometimes with numeric targets it does (although the setup might not be easy)
some Auto Machine Learning to see where that might take you
try hyper parameter tuning if automl does not help
if you need to make some final gains on your model letting some automatic model or hyper parameter run for some hours or even days might get you a few extra points

RichardFL · February 3, 2019, 12:37am

@mlauder
Thank you for all your help…
I’m uploading my workflow and the data.

Any advise greatly apreciated…

BigMartSales 2.knwf (2.3 MB)

Train_Dataset.xls (1.6 MB)

Test_Dataset.xls (1.0 MB)

mlauber71 · February 3, 2019, 6:22pm

A few quick remarks (more on that later):

your RMSE score does not measure your target but compares the Item Visibility with the Predicted sales. But you want to predict sales?
you do not split your training data into 2 or 3 parts but seem to use the data you want to score later as a reference, I think that will not work

Will have more on that later. RMSE of 1.300 seems achievable, not sure if that can help you.

mlauber71 · February 3, 2019, 7:36pm

I changed the workflow so it would run and produce a result you can interpret. I also added tie other workflows with a H2O model and a XGBoost ensemble but the numbers are not getting better. There might be some work to do with regards to normalisation and feature engineering. From what I see the models get the direction right but does not match the exact numbers.

big_mart_sales.knar (1.3 MB)

RichardFL · February 26, 2019, 2:33am

Hi @mlauder

Thank you for your reply and help.
1- I see that me using two separate files, one for training and one for testing doesn’t work. How come? isn’t this the standard.?

2- If I merge my training data with my testing data and use the Partitioning Node, how can I ensure that Knime knows which data is which? My testing file does not have the column with data I need to predict (Item_outlet_sales) column, How do I make sure Knime does not try to use this to train the model?

Thanks in advance!

mlauber71 · February 26, 2019, 6:35pm

The question is what is your target variable. From the structure of you data it seemed your second dataset was the one with the unknown target data. You would have to define which column you want to predict. Then you would split your original data into test and training. If your testate does not contain the target variables you will not be able to test the quality of your model.

If you use the Partitioning node you have two Streams of data. You tell them apart by connecting the training nodes to the upper arrow and the testing/predicting nodes to the lower part.

The modification of your workflow I uploaded renames your value to “Target”.

It might be useful to familiarise yourself with the concepts of predicting with a basic example, where you can still read the rules that are generated: