Bad correlation

abhisek · December 30, 2018, 6:59am

Can anyone suggest why I am getting poor correlation in new dataset though train test and validation set show good correlation almost 89%…

mlauber71 · December 30, 2018, 9:37am

you can do these things:

check the importance of your variables in the model and see if the top ones might contain ‘leaks’ that somehow give away the correct answer
see if you have variables like year not in a relative but an absolute way (like year=2016 which might change over time)
check if your train, test, validate datasets are truly separated. Eg you have households that you split by person, and one ends up in test one in training. But they share pretty much the same data and target
see if your new data contains all the variables in the same quality like your original set
you might think about what you accuracy does mean in the context of you business question (cf. links below)

Then you might provide us with more details or even a sample workflow, you might use fake data if you cannot share the original data.

Models for 0/1 or Yes/No Targets

Understand metrics like AUC and Gini (and use H2O.ai)

Iris · December 30, 2018, 12:08pm

You can also run a Cross Validation to see if the accuracy differ over the various runs.

Cheers, Iris

abhisek · December 30, 2018, 12:37pm

Thanks for the suggestions…actually I have processed the entire dataset then I kept certain portion of that outside to test for better confidence based on some data ID…then I divide the remaining part into 3… training testing and validation set using 2partitioning node …when I found correlation is good in testing set I tried to see the model with that initially kept data portion…there I found it is not giving good correlation…