10 Many models

In this lesson we will do something naughty; we will estimate a huge number of simple models to a data set. You may recall from your basic courses in statistics and econometrics that, as a general rule, we need to have some sort of a plan before we try to build for instance a linear regression model to explain variation in Y using a (generally unknown sub-)set of explanatory variables X_1, \ldots, X_p. We do not want just to fit models until we get something interesting (tempting as it may be) because the results may then be entirely co-incidental, see this web page for several absurd examples of spurious correlations that are found by just trawling the web for time series that are similar.

We will of course not in any way suggest that the algorithm

while(model not interesting) {keep looking}

is anything other than cheating. We will however explore the other extreme from conservative inferential regression model building: What if we fit all possible models? Everything. This may result in some interesting insights because we interpret the estimated models more like derived data points that can be analyzed statistically in their own right rather than the result of a model-building exercise.

This lesson follows Chapter 25 of the first edition of R for Data Science (in the second edition, this chapter is taken out) by Hadley Wickham. In the video window at the bottom of this page we go through the main script file for this lesson. At the end of the video we also discuss the assignment questions for this week (and you will see that this video was not recorded this year, but the main content is the same). The script for the lesson is available in this repository.

If you want to explore these ideas a bit more conceptually, you can read the paper I just ran four million regressions by Xavier X. Sala-i-Martin.