When analyzing data, we may want to establish the nature of the relationship between two or more variables, as well as making predictions.
To do that, we need to run correlation and regression analysis.
Correlation concerns the strength of the relationship between the values of two variables.
Regression analysis determines the nature of that relationship and enables us, under certain conditions, to make predictions from it.
Which regression model?
But there are a lot of possible regression analysis models, from simple to highly complex ones, with, by increasing level of complexity:
– Linear model
– Higher order polynomials model
– Multivariate model
– Etc.
Is a bigger model always better? Is the model with the smallest error the best?
Given that we have more than one model, which one is best?
As we increase the complexity, the model becomes better, with a decreasing error. But there may be may be an intermediate model, relatively good, but less complex.
A cure is validation. Validation means simply running the model on a new set of data that has not been unsed in the learning phase.
Validation is the cure, but most often, we don’t have the luxury to have more data.
Another solution is doing cross validation.
To do that we will hold aside a random subset of the set of data, as a first test. Then a second random subset of the set of data as a second test.
And we will repeat the procedure again and again with other subsets of data.
And every single time, we will calculate an error. The cross validate error will be the mean and the standard deviation of all the errors.