I was recently inspired by this following PyData London talk by Vincent Warmerdam. It’s a great talk: he has a lot of great tricks to make simple, small-brain models really work wonders, and he emphasizes thinking about your problem in a logical way over trying to use cutting-edge (Tensorflow) or hyped-up (deep learning) methods just for the sake of using them - something I’m amazed that people seem to need to be reminded of.

One of my favorite tricks was the first one he discussed: extracting and forecasting the seasonality of sales of some product, just by using linear regression (and some other neat but ultimately simple tricks).

That’s when I started feeling guilty about not really grokking linear regression. It sounds stupid for me to say, but I’ve never really managed to really understand it in any of my studies. The presentation always seemed very canned, each topic coming out like a sardine: packed so close together, but always slipping from your hands whenever you pick them up.

So what I’ve done is take the time to really dig into the math and explain how all of this linear regression stuff hangs together, trying (and only partially succeeding) not to mention any domain-specific names. This post will hopefully be helpful for people who have had some exposure to linear regression before, and some fuzzy recollection of what it might be, but really wants to see how everything fits together.

There’s going to be a fair amount of math (enough to properly explain the gist of linear regression), but I’m really not emphasizing proofs here, and I’ll even downplay explanations of the more advanced concepts, in favor of explaining the various flavors of linear regression and how everything hangs together.

So Uh, What is Linear Regression?

The basic idea is this: we have some number that we’re interested in. This number could be the price of a stock, the number of stars a restaurant has on Yelp… Let’s denote this number-that-we-are-interested-in by the letter . Occasionally, we may have multiple observations for (e.g. we monitored the price of the stock over many days, or we surveyed many restaurants in a neighborhood). In this case, we stack these values of and consider them as a single vector: . To be explicit, if we have observations of , then will be an -dimensional vector.

We also have some other numbers that we think are related to . More explicitly, we have some other numbers that we suspect tell us something about . For example (in each of the above scenarios), they could be how the stock market is doing, or the average price of the food at this restaurant. Let us denote these numbers-that-tell-us-something-about-y by the letter . So if we have such numbers, we’d call them . Again, we occasionally have multiple observations: in which case, we arrange the values into an matrix which we call .

If we have this setup, linear regression simply tells us that is a weighted sum of the s, plus some constant term. Easier to show you.

where the and s are all scalars to be determined, and the is an error term (a.k.a. the residual).

Note that we can pull the same stacking trick here: the s will become a -dimensional vector, , and similarly for the s. Note that remains common throughout all observations.

If we consider different observations, we can write the equation much more succinctly if we simply prepend a column of s to the matrix and prepend an extra element (what used to be the ) to the vector.

Then the equation can be written as:

That’s it. The hard part (and the whole zoo of different kinds of linear regressions) now comes from two questions:

  1. What can we assume, and more importantly, what can’t we assume about and ?
  2. Given and , how exactly do we find and ?

The Small-Brain Solution: Ordinary Least Squares

This section is mostly just a re-packaging of what you could find in any introductory statistics book, just in fewer words.

Instead of futzing around with whether or not we have multiple observations, let’s just assume we have observations: we can always set if that’s the case. So,

  • Let and be -dimensional vectors
  • Let be an matrix

The simplest, small-brain way of getting our parameter is by minimizing the sum of squares of the residuals:

Our estimate for then has a “miraculous” closed-form solution1 given by:

This solution is so (in)famous that it been blessed with a fairly universal name, but cursed with the unimpressive name ordinary least squares (a.k.a. OLS).

If you have a bit of mathematical statistics under your belt, it’s worth noting that the least squares estimate for has a load of nice statistical properties. It has a simple closed form solution, where the trickiest thing is a matrix inversion: hardly asking for a computational miracle. If we can assume that is zero-mean Gaussian, the least squares estimate is the maximum likelihood estimate. Even better, if the errors are uncorrelated and homoskedastic, then the least squares estimate is the best linear unbiased estimator. Basically, this is very nice. If most of that flew over your head, don’t worry - in fact, forget I said anything at all.

Why the Small-Brain Solution Sucks

There are a ton of reasons. Here, I’ll just highlight a few.

  1. Susceptibilty to outliers
  2. Assumption of homoskedasticity
  3. Collinearity in features
  4. Too many features

Points 1 and 2 are specific to the method of ordinary least squares, while 3 and 4 are just suckish things about linear regression in general.

Outliers

The OLS estimate for is famously susceptible to outliers. As an example, consider the third data set in Anscombe’s quartet. That is, the data is almost a perfect line, but the th data point is a clear outlier. That single data point pulls the entire regression line closer to it, which means it fits the rest of the data worse, in order to accommodate that single outlier.

Heteroskedasticity and Correlated Residuals

Baked into the OLS estimate is an implicit assumption that the s all have the same variance. That is, the amount of noise in our data is independent of what region of our feature space we’re in. However, this is usually not a great assumption. For example, harking back to our stock price and Yelp rating examples, this assumption states that the price of a stock fluctuates just as much in the hour before lunch as it does in the last 5 minutes before market close, or that Michelin-starred restaurants have as much variation in their Yelp ratings as do local coffee shops.

Even worse: not only can the residuals have different variances, but they may also even be correlated! There’s no reason why this can’t be the case. Going back to the stock price example, we know that high-volatility regimes introduce much higher noise in the price of a stock, and volatility regimes tend to stay fairly constant over time (notwithstanding structural breaks), which means that the level of volatility (i.e. noise, or residual) suffers very high autocorrelation.

The long and short of this is that some points in our training data are more likely to be impaired by noise and/or correlation than others, which means that some points in our training set are more reliable/valuable than others. We don’t want to ignore the less reliable points completely, but they should count less in our computation of than points that come from regions of space with less noise, or not impaired as much by correlation.

Collinearity

Collinearity (or multi-collinearity) is just a fancy way of saying that our features are correlated. In the worst case, suppose that two of our columns in the matrix are identical: that is, we have repeated data. Then, bad things happen: the matrix no longer has full rank (or at least, becomes ill-conditioned), which means the actual inversion becomes an extremely sensitive operation and is liable to give you nonsensically large or small regression coefficients, which will impact model performance.

Too Many Features

Having more data may be a good thing, but more specifically, having more observations is a good thing. Having more features might not be a great thing. In the extreme case, if you have more features than observations, (i.e. ), then the OLS estimate of generally fails to be unique. In fact, as you add more and more features to your model, you will find that model performance will begin to degrade long before you reach this point where .

Expanding-Brain Solutions

Here I’ll discuss some add-ons and plugins you can use to upgrade your Ordinary Least Squares Linear Regression™ to cope with the four problems I described above.

Heteroskedasticity and Correlated Residuals

To cope with different levels of noise, we can turn to generalized least squares (a.k.a. GLS), which is basically a better version of ordinary least squares. A little bit of math jargon lets us explain GLS very concisely. Instead of minimizing the Euclidean norm of the residuals, we minimize its Mahalanobis norm: in this way, we take into account the second-moment structure of the residuals, and allows us to put more weight on the data points on more valuable data points (i.e. those not impaired by noise or correlation).

Mathematically, the OLS estimate is given by

whereas the GLS estimate is given by

where is the known covariance matrix of the residuals.

Now, the GLS estimator enjoys a lot of statistical properties: it is unbiased, consistent, efficient, and asymptotically normal. Basically, this is very very nice.

In practice though, since is usually not known, approximate methods (such as weighted least squares, or feasible generalized least squares) which attempt to estimate the optimal weight for each training point, are used. One thing that I found interesting while researching this was that these methods, while they attempt to approximate something better than OLS, may end up performing worse than OLS! In other words (and more precisely), it’s true that these approximate estimators are asymptotically more efficient, for small or medium data sets, they can end up being less efficient than OLS. This is why some authors prefer to just use OLS and find some other way to estimate the variance of the estimator (where this some other way is, of course, robust to heteroskedasticity or correlation).

Outliers

Recall that OLS minimizes the sum of squares (of residuals):

A regularized estimation scheme adds a penalty term on the size of the coefficients:

where is some function of . Common choices for are:

  • The norm:

  • The norm:

  • Interpolating between the the first two options: , where

While regularized regression has empirically been found to be more resilient to outliers, it comes at a cost: the regression coefficients lose their nice interpretation of “the effect on the regressand of increasing this regressor by one unit”. Indeed, regularization can be thought of as telling the universe: “I don’t care about interpreting the regression coefficients, so long as I get a reasonable fit that is robust to overfitting”. For this reason, regularization is usually used for prediction problems, and not for inference.

An alternative solution would be to apply some pre-processing to our data: for example, some anomaly detection on our data points could remove outliers from the consideration of our linear regression. However, this method also comes with its own problems - what if it removes the wrong points? It has the potential to really mess up our model if it did.

The main takeaway, then, is that outliers kinda just suck.

Collinearity

Collinearity a problem that comes and goes - sometimes it’s there, othertimes not, and it’s better to always pretend it’s there than it is to risk forgetting about it.

There are many ways to detect multicollinearity, many ways to remedy it and many consequences if you don’t. The Wikipedia page is pretty good at outlining all of those, so I’ll just point to that.

An alternative that Wikipedia doesn’t mention is principal components regression (PCR), which is literally just principal components analysis followed by ordinary least squares. As you can imagine, by throwing away some of the lower-variance components, you can usually remove some of the collinearity. However, this comes at the cost of interpretability: there is no easy way to intuit the meaning of a principal component.

A more sophisticated approach would be a close cousin of PCR: partial least squares regression. It’s a bit more mathematically involved, and I definitely don’t have the time to do it full justice here. Google!

Too Many Features

Having too many features to choose from sounds like the first-world problem of data science, but it opens up the whole world of high-dimensional statistics and feature selection. There are a lot of techniques that are at your disposal to winnow down the number of features here, but the one that is most related to linear regression is least angle regression (a.k.a. LAR or LARS). It’s an iterative process that determines the regression coefficients according to which features are most correlated with the target, and increases (or decreases) these regression coefficients until some other feature looks like it has more explanatory power (i.e. more correlated with the target). Like so many other concepts in this post, I can’t properly do LAR justice in such a short space, but hopefully the idea was made apparent.

Of course, there are other methods for feature selection too: you can run a regularized regression to force most of the features to have zero or near-zero coefficients, or you could use any of the tools in sklearn.feature_selection.

Now What?

So that was pretty rushed and a bit hand-wavy, but hopefully it gave you a high-level view of what linear regression is, and how all these other flavors of linear regression differ from the ordinary least squares, and how they were made to remedy specific shortcomings of OLS.

And it should come as no surprise that there are even more directions to take the concept of linear regression: generalized linear models (a.k.a. GLMs) allow you to model different kinds of variables (e.g. what if is a binary response, instead of a continuous variable?), and Bayesian linear regression offers an amazing way to quantify the uncertainty in your coefficients. Big world; happy hunting!


  1. Insert obligatory footnote here about the Moore-Penrose inverse a.k.a. the pseudoinverse