Publications by Nina Zumel

Bandit Formulations for A/B Tests: Some Intuition

24.04.2014

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the e...

9527 sym R (2679 sym/7 pcs) 14 img

Trimming the Fat from glm() Models in R

30.05.2014

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models...

6435 sym R (7452 sym/11 pcs) 8 img

Vtreat: designing a package for variable treatment

07.08.2014

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear in the training data (especially when th...

15822 sym R (10113 sym/17 pcs) 4 img

Estimating Generalization Error with the PRESS statistic

25.09.2014

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In particula...

9386 sym R (2394 sym/9 pcs) 6 img

The Geometry of Classifiers

18.12.2014

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly ...

13027 sym 12 img

Does Balancing Classes Improve Classifier Performance?

27.02.2015

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical o...

12441 sym R (924 sym/2 pcs) 12 img

Wanted: A Perfect Scatterplot (with Marginals)

11.06.2015

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variab...

3081 sym 8 img

Working with Sessionized Data 1: Evaluating Hazard Models

08.07.2015

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this �...

821 sym

Working with Sessionized Data 2: Variable Selection

15.07.2015

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we ses...

858 sym

Bootstrap Evaluation of Clusters

04.09.2015

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this ...

6960 sym R (5859 sym/4 pcs) 4 img