Publications by Nina Zumel

Using replyr::let to Parameterize dplyr Expressions

06.12.2016

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the ...

2980 sym R (2095 sym/3 pcs) 2 img

A Simple Example of Using replyr::gapply

19.12.2016

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Suc...

3543 sym R (2588 sym/4 pcs) 2 img

Teaching pivot / un-pivot

11.04.2017

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or ...

7079 sym R (2630 sym/9 pcs) 4 img

Custom Level Coding in vtreat

25.09.2017

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically prefe...

7891 sym R (4197 sym/9 pcs) 8 img

Partial Pooling for Lower Variance Variable Encoding

28.09.2017

Banaue rice terraces. Photo: Jon Rawlinson In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R. We will use the lme4 package to fit the hierarc...

10086 sym R (1150 sym/6 pcs) 14 img 1 tbl

Announcing Practical Data Science with R, 2nd Edition

15.08.2018

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R! Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as they become available, and give us feedback before the ...

1705 sym 2 img

Faceted Graphs with cdata and ggplot2

21.10.2018

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two p...

2364 sym R (1424 sym/8 pcs) 8 img

Scatterplot matrices (pair plots) with cdata and ggplot2

27.10.2018

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot? A pairs plot compactly plots every (numeric) variable in a dataset against every other o...

3062 sym R (3714 sym/12 pcs) 4 img

More on sigr

06.11.2018

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.L...

1693 sym R (2419 sym/6 pcs)

PDSwR2: New Chapters!

06.02.2019

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more....

1420 sym 2 img