Publications by Nina Zumel

Faceted Graphs with cdata and ggplot2

21.10.2018

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two p...

2364 sym R (1424 sym/8 pcs) 8 img

Scatterplot matrices (pair plots) with cdata and ggplot2

27.10.2018

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot? A pairs plot compactly plots every (numeric) variable in a dataset against every other o...

3062 sym R (3714 sym/12 pcs) 4 img

More on sigr

06.11.2018

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.L...

1693 sym R (2419 sym/6 pcs)

PDSwR2: New Chapters!

06.02.2019

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more....

1420 sym 2 img

Cohen’s D for Experimental Planning

18.06.2019

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a contro...

8068 sym R (3518 sym/13 pcs) 8 img

Link Functions versus Data Transforms

07.07.2019

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against inco...

6081 sym R (3066 sym/12 pcs) 2 img 15 tbl

Common Ensemble Models can be Biased

11.07.2019

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be pref...

3254 sym R (3257 sym/9 pcs) 2 img 11 tbl

An Ad-hoc Method for Calibrating Uncalibrated Models

16.07.2019

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more...

6383 sym R (3725 sym/14 pcs) 2 img 15 tbl

WVPlots 1.1.2 on CRAN

12.09.2019

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that t...

1797 sym R (912 sym/5 pcs) 10 img

Why Do We Plot Predictions on the x-axis?

27.09.2019

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an "ideal" linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*rnorm(N) y = x1 + x2 + noise df = ...

4949 sym R (2332 sym/5 pcs) 14 img