Publications by Nina Zumel

The Extra Step: Graphs for Communication versus Exploration

12.01.2014

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical. One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of ...

8740 sym R (5436 sym/9 pcs) 18 img

The Statistics behind “Verification by Multiplicity”

01.03.2014

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here at Win-Vector, but we do somet...

2934 sym 2 img

Practical Data Science with R: Release date announced

25.03.2014

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in a...

1012 sym 4 img

Bandit Formulations for A/B Tests: Some Intuition

24.04.2014

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the e...

9527 sym R (2679 sym/7 pcs) 14 img

Trimming the Fat from glm() Models in R

30.05.2014

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models...

6435 sym R (7452 sym/11 pcs) 8 img

Vtreat: designing a package for variable treatment

07.08.2014

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear in the training data (especially when th...

15822 sym R (10113 sym/17 pcs) 4 img

Estimating Generalization Error with the PRESS statistic

25.09.2014

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In particula...

9386 sym R (2394 sym/9 pcs) 6 img

The Geometry of Classifiers

18.12.2014

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly ...

13027 sym 12 img

Does Balancing Classes Improve Classifier Performance?

27.02.2015

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical o...

12441 sym R (924 sym/2 pcs) 12 img

Wanted: A Perfect Scatterplot (with Marginals)

11.06.2015

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variab...

3081 sym 8 img