Differently distributed train and test samples

Hi!
I’m building a model at work for credit scoring and I have relative low AUC. After some investigation I figured that the train and test sets have very different distributions of the variables within. While this is expected as real-life data (economical environment change), my problem still stands. And the bigger challenge is to prove it better than a logistic regression.

Hi there!

You were trying to model it with logistic regression? You can always try another approach and compare results. Here is workflow example for credit scoring using different predictive models:
https://hub.knime.com/knime/workflows/Examples/50_Applications/02_Credit_Scoring/01_CreditScoring*CB0u_eLmzlghiZI2

Regarding different distributions of variables - how did you get your train and test sets?

Br,
Ivan

My question is strictly about the difference in the samples. By internal guide I have to use different periods for each sample.

Hi,

ok. how do you create samples?

Br,
Ivan

1 Like

I have to take different periods. The theoretical case is I score now, expect confirmation in the future. So I test with the “future” data. And present and future differ in terms of distros.

Hi,

I see. And this is case for all variables. And what is the interval between present and future data? If it is too big shouldn’t you train new model?

Br,
Ivan

Should be no less than a year. :confused:

Hi there!

Just found this question with couple of answers and useful link so it might help :wink:

Br,
Ivan

1 Like

The thread linked by Ivan has valuable proposals. The only trick is that in order for the proposed procedures to work you need to have the future data (=test sample) at hand. This is of course the case in your model building pipeline (as you slice an existing past data and name a subset of it “the future”). However, if you want the resulting model to be used in production, you need to think if all features that you are going to use in training will be available for new unseen data at the moment, when you are going to make predictions to make business decisions, and if you are, if you have latency to re-train your model as you will need to do as weights will depend on new data. Otherwise your evaluation metric will be better than real performance.

So in some scenarios you can improve by making your training data closer to the test data, but in others there is no way to improve on it and you just have to live with low model performance due to non-stationarity of the data.

Cheers,
Misha

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.