A Beautiful Regression Formula

Lost in Translation between R and Python 4

Hello statistics aficionados

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

The last one was a deep dive into historic mortality rates.

No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model

E(y)=α+j=1pxjβjE(y) = \alpha + \sum_{j = 1}^p x_j \beta_j

fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:

R2=j=1pcor(y,xj)β^j.R^2 = \sum_{j = 1}^p \text{cor}(y, x_j)\hat\beta_j.

Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.

Direct consequences of the formula are:

  1. If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
  2. A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.

Note that all formulas refer to in-sample calculations.

Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.

R
Python

Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.

The Python notebook and R code can be found at:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *