Hello, I’m trying to understand how cross validation works.

let’s start from one of the workflow already available among the examples: cross validation with SVM.

the original data set is divided in 5 subset. Each time, 4 subsets are used to train the model which is then used to predict the fifth subset. At the end of the process, the X-aggregators reports in the “errors rate table” the error for each of the five runs. 1) what does the “prediction table” report? the predictions referred to which of the five run?

2) is the whole procedure used to identify a model? in this case, if I want to apply it to a new set of data, how can I do it?

thanks!!!

Hi @Ginevra, I try to explain it briefly:

- The Prediction Table concatenates the predictions of the 5 “fifth” subsets. So the first fifth of this table belongs to the model that is trained on the other 4 subsets, and so on.
- The whole procedure identifies 5 different models. The models are not aggregated into one model. Cross Validation is used to judge whether the resulting models are stable. If the five error rates are very different, this should tell you that your model will be very dependent on the data situation. You may have a nice error rate when you build your model, but this is likely to change when you apply your model for future predictions.

2 Likes

@agaunt thank you very much!! and what happens in the prediction table if I apply the leave one out ?

@Ginevra

Leave one out means: If you have n observations, then you build n models, and every model depends on (n-1) observations and the prediction will be done for the remaining observation.