Publications by John Mount

vtreat up on CRAN!

06.09.2015

Nina Zumel and I are proud to announce our R vtreat variable treatment library has just been accepted by CRAN! It will take some time for the vtreat package to progress to various CRAN mirrors, but as of now you can install vtreat with the command: install.packages('vtreat', repos='http://cran.r-project.org/') Instead of needing to use devt...

5925 sym 2 img

How do you know if your model is going to work? Part 3: Out of sample procedures

14.09.2015

Authors: John Mount (more articles) and Nina Zumel (more articles). When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?”...

5630 sym 14 img

Using differential privacy to reuse training data

05.10.2015

Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression. This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially prot...

12294 sym 24 img

Some key Win-Vector serial data science articles

07.10.2015

As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence. What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series includ...

1634 sym 2 img

Don’t use stats::aggregate()

31.10.2015

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate(). Rea...

1844 sym R (1845 sym/4 pcs)

Fast food, fast publication

08.11.2015

The following article is getting quite a lot of press right now: David Just and Brian Wansink (2015). Fast Food, Soft Drink, and Candy Intake is Unrelated to Body Mass Index for 95% of American Adults. Obesity Science & Practice, forthcoming (upcoming in a new pay for placement journal). Obviously it is a popular contrary position (some coverage...

3933 sym R (899 sym/1 pcs) 2 img

Free gradient boosting lecture

21.11.2015

We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science. (link, all support material here). Please help us get the word out by sharing/Tweeting! Related ...

739 sym

Wald’s sequential analysis technique

10.12.2015

Microsoft Revolution Analytics has just posted our latest article on A/B testing: Wald’s graphical sequential inspection procedure. It is a fun appreciation of a really cool procedure and I hope you check it out. Figure 14, Section 6.4.2, page 111, Abraham Wald, Sequential Analysis, Dover 2004 (reprinting a 1947 edition). Related To leave a...

729 sym 2 img

Sequential Analysis

11.12.2015

We here at Win-Vector LLC been working through an ad-hoc series about A/B testing combining elements of both operations research and statistical points of view. A dynamic programming solution to A/B test design Why does designing a simple A/B test seem so complicated? A clear picture of power and significance in A/B tests Bandit Formulations for...

6995 sym 6 img

Practical Data Science with R examples

11.12.2015

One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository. Some readers w...

2173 sym