Lost in Translation between R and Python 4
Hello statistics aficionados
This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.
The last one was a deep dive into historic mortality rates.
No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model
fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:
Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.
Direct consequences of the formula are:
- If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
- A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.
Note that all formulas refer to in-sample calculations.
Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.
y <- "Sepal.Width"
x <- c("Sepal.Length", "Petal.Length", "Petal.Width")
# Scaled version of iris
iris2 <- data.frame(scale(iris[c(y, x)]))
# Fit model
fit <- lm(reformulate(x, y), data = iris2)
summary(fit) # multiple R-squared: 0.524
(betas <- coef(fit)[x])
# Sepal.Length Petal.Length Petal.Width
# 1.1533143 -2.3734841 0.9758767
# Correlations (scaling does not matter here)
(cors <- cor(iris[, y], iris[x]))
# Sepal.Length Petal.Length Petal.Width
# -0.1175698 -0.4284401 -0.3661259
# The R-squared?
sum(betas * cors) # 0.524
# Import packages
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
# Load data
iris = datasets.load_iris(as_frame=True).data
print("The data:", iris.head(3), sep = "\n")
# Specify response
yvar = "sepal width (cm)"
# Correlations of everyone with response
cors = iris.corrwith(iris[yvar]).drop(yvar)
print("\nCorrelations:", cors, sep = "\n")
# Prepare scaled response and covariables
X = StandardScaler().fit_transform(iris.drop(yvar, axis=1))
y = StandardScaler().fit_transform(iris[[yvar]])
# Fit linear regression
OLS = LinearRegression().fit(X, y)
betas = OLS.coef_[0]
print("\nScaled coefs:", betas, sep = "\n")
# R-squared via scikit-learn: 0.524
print(f"\nUsual R-squared:\t {OLS.score(X, y): .3f}")
# R-squared via decomposition: 0.524
rsquared = betas @ cors.values
print(f"Applying the formula:\t {rsquared: .3f}")
Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.
The Python notebook and R code can be found at:
Leave a Reply