Publications by Nina Zumel
Using replyr::let to Parameterize dplyr Expressions
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the ...
2980 sym R (2095 sym/3 pcs) 2 img
A Simple Example of Using replyr::gapply
It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Suc...
3543 sym R (2588 sym/4 pcs) 2 img
Teaching pivot / un-pivot
Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or ...
7079 sym R (2630 sym/9 pcs) 4 img
Custom Level Coding in vtreat
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically prefe...
7891 sym R (4197 sym/9 pcs) 8 img
Partial Pooling for Lower Variance Variable Encoding
Banaue rice terraces. Photo: Jon Rawlinson In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R. We will use the lme4 package to fit the hierarc...
10086 sym R (1150 sym/6 pcs) 14 img 1 tbl
Announcing Practical Data Science with R, 2nd Edition
We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R! Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as they become available, and give us feedback before the ...
1705 sym 2 img
Faceted Graphs with cdata and ggplot2
In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two p...
2364 sym R (1424 sym/8 pcs) 8 img
Scatterplot matrices (pair plots) with cdata and ggplot2
In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot? A pairs plot compactly plots every (numeric) variable in a dataset against every other o...
3062 sym R (3714 sym/12 pcs) 4 img
More on sigr
If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.L...
1693 sym R (2419 sym/6 pcs)
PDSwR2: New Chapters!
We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more....
1420 sym 2 img