Differently distributed train and test samples

deicide_bg · May 21, 2019, 8:44pm

Hi!
I’m building a model at work for credit scoring and I have relative low AUC. After some investigation I figured that the train and test sets have very different distributions of the variables within. While this is expected as real-life data (economical environment change), my problem still stands. And the bigger challenge is to prove it better than a logistic regression.

ipazin · May 22, 2019, 8:34am

Hi there!

You were trying to model it with logistic regression? You can always try another approach and compare results. Here is workflow example for credit scoring using different predictive models:
https://hub.knime.com/knime/workflows/Examples/50_Applications/02_Credit_Scoring/01_CreditScoring*CB0u_eLmzlghiZI2

Regarding different distributions of variables - how did you get your train and test sets?

Br,
Ivan

deicide_bg · May 22, 2019, 8:52am

My question is strictly about the difference in the samples. By internal guide I have to use different periods for each sample.

ipazin · May 22, 2019, 10:33am

Hi,

ok. how do you create samples?

Br,
Ivan

deicide_bg · May 22, 2019, 10:48am

I have to take different periods. The theoretical case is I score now, expect confirmation in the future. So I test with the “future” data. And present and future differ in terms of distros.

ipazin · May 22, 2019, 11:11am

Hi,

I see. And this is case for all variables. And what is the interval between present and future data? If it is too big shouldn’t you train new model?

Br,
Ivan

deicide_bg · May 22, 2019, 11:22am

Should be no less than a year.

ipazin · May 23, 2019, 8:56am

Hi there!

Just found this question with couple of answers and useful link so it might help

Br,
Ivan

lisovyi · May 23, 2019, 9:24am

The thread linked by Ivan has valuable proposals. The only trick is that in order for the proposed procedures to work you need to have the future data (=test sample) at hand. This is of course the case in your model building pipeline (as you slice an existing past data and name a subset of it “the future”). However, if you want the resulting model to be used in production, you need to think if all features that you are going to use in training will be available for new unseen data at the moment, when you are going to make predictions to make business decisions, and if you are, if you have latency to re-train your model as you will need to do as weights will depend on new data. Otherwise your evaluation metric will be better than real performance.

So in some scenarios you can improve by making your training data closer to the test data, but in others there is no way to improve on it and you just have to live with low model performance due to non-stationarity of the data.

Cheers,
Misha

system · November 21, 2019, 9:24pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.