Publications by Nina Zumel
How do you know if your model is going to work?
Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of a ...
26105 sym 32 img
A Simpler Explanation of Differential Privacy
Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this article we’ll work through ...
14303 sym R (1207 sym/1 pcs) 18 img
Our Differential Privacy Mini-series
We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch...
1880 sym 2 img
Upcoming Win-Vector Appearances
We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and h...
1709 sym
“Introduction to Data Science” video course contest is closed
Congratulations to all the winners of the Win-Vector “Introduction to Data Science” Video Course giveaway! We’ve emailed all of you your individual subscription coupons. Even though this contest is over, we still encourage those interested to join our mailing list. Our updates to the list will be infrequent, but (we hope) informative. For ...
1319 sym
Using PostgreSQL in R: A quick how-to
The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverles...
3578 sym R (6578 sym/11 pcs) 2 img
Finding the K in K-means by Parametric Bootstrap
One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data S...
7641 sym 14 img
Principal Components Regression, Pt.1: The Standard Method
In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for predicti...
15424 sym R (14718 sym/18 pcs) 21 img
Principal Components Regression, Pt. 2: Y-Aware Methods
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in...
14346 sym R (6949 sym/12 pcs) 16 img
Principal Components Regression, Pt. 3: Picking the Number of Components
In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of ...
9503 sym R (3914 sym/4 pcs) 8 img