Publications by Nina Zumel
Cohen’s D for Experimental Planning
In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a contro...
8068 sym R (3518 sym/13 pcs) 8 img
Link Functions versus Data Transforms
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against inco...
6081 sym R (3066 sym/12 pcs) 2 img 15 tbl
Common Ensemble Models can be Biased
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be pref...
3254 sym R (3257 sym/9 pcs) 2 img 11 tbl
An Ad-hoc Method for Calibrating Uncalibrated Models
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more...
6383 sym R (3725 sym/14 pcs) 2 img 15 tbl
WVPlots 1.1.2 on CRAN
I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that t...
1797 sym R (912 sym/5 pcs) 10 img
Why Do We Plot Predictions on the x-axis?
When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an "ideal" linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*rnorm(N) y = x1 + x2 + noise df = ...
4949 sym R (2332 sym/5 pcs) 14 img
When Cross-Validation is More Powerful than Regularization
Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensit...
7516 sym R (4810 sym/12 pcs) 14 img 2 tbl
Monitoring for Changes in Distribution with Resampling Tests
A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether...
12275 sym R (913 sym/5 pcs) 12 img 2 tbl
Linear and Logistic Regression in Practical Data Science with R 2nd Edition
One of the chapters that we are especially proud of in Practical Data Science with R is Chapter 7, “Linear and Logistic Regression.” We worked really hard to explain the fundamental principles behind both methods in a clear and easy-to-understand form, and to document diagnostics returned by the R implementations of lm and glm. For the second...
1021 sym
Exploring the XI Correlation Coefficient
Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) (the nu...
7847 sym Python (1379 sym/4 pcs) 8 img 1 tbl