Publications by Nina Zumel

Wanted: A Perfect Scatterplot (with Marginals)

11.06.2015

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variab...

3081 sym 8 img

Working with Sessionized Data 1: Evaluating Hazard Models

08.07.2015

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this �...

821 sym

Working with Sessionized Data 2: Variable Selection

15.07.2015

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we ses...

858 sym

Bootstrap Evaluation of Clusters

04.09.2015

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this ...

6960 sym R (5859 sym/4 pcs) 4 img

How do you know if your model is going to work?

22.09.2015

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of a ...

26105 sym 32 img

A Simpler Explanation of Differential Privacy

02.10.2015

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this article we’ll work through ...

14303 sym R (1207 sym/1 pcs) 18 img

Our Differential Privacy Mini-series

01.11.2015

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch...

1880 sym 2 img

Upcoming Win-Vector Appearances

09.11.2015

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and h...

1709 sym

“Introduction to Data Science” video course contest is closed

26.01.2016

Congratulations to all the winners of the Win-Vector “Introduction to Data Science” Video Course giveaway! We’ve emailed all of you your individual subscription coupons. Even though this contest is over, we still encourage those interested to join our mailing list. Our updates to the list will be infrequent, but (we hope) informative. For ...

1319 sym

Using PostgreSQL in R: A quick how-to

01.02.2016

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverles...

3578 sym R (6578 sym/11 pcs) 2 img