Publications by John Mount
vtreat up on CRAN!
Nina Zumel and I are proud to announce our R vtreat variable treatment library has just been accepted by CRAN! It will take some time for the vtreat package to progress to various CRAN mirrors, but as of now you can install vtreat with the command: install.packages('vtreat', repos='http://cran.r-project.org/') Instead of needing to use devt...
5925 sym 2 img
How do you know if your model is going to work? Part 3: Out of sample procedures
Authors: John Mount (more articles) and Nina Zumel (more articles). When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?”...
5630 sym 14 img
Using differential privacy to reuse training data
Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression. This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially prot...
12294 sym 24 img
Some key Win-Vector serial data science articles
As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence. What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series includ...
1634 sym 2 img
Don’t use stats::aggregate()
When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate(). Rea...
1844 sym R (1845 sym/4 pcs)
Fast food, fast publication
The following article is getting quite a lot of press right now: David Just and Brian Wansink (2015). Fast Food, Soft Drink, and Candy Intake is Unrelated to Body Mass Index for 95% of American Adults. Obesity Science & Practice, forthcoming (upcoming in a new pay for placement journal). Obviously it is a popular contrary position (some coverage...
3933 sym R (899 sym/1 pcs) 2 img
Free gradient boosting lecture
We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science. (link, all support material here). Please help us get the word out by sharing/Tweeting! Related ...
739 sym
Wald’s sequential analysis technique
Microsoft Revolution Analytics has just posted our latest article on A/B testing: Wald’s graphical sequential inspection procedure. It is a fun appreciation of a really cool procedure and I hope you check it out. Figure 14, Section 6.4.2, page 111, Abraham Wald, Sequential Analysis, Dover 2004 (reprinting a 1947 edition). Related To leave a...
729 sym 2 img
Sequential Analysis
We here at Win-Vector LLC been working through an ad-hoc series about A/B testing combining elements of both operations research and statistical points of view. A dynamic programming solution to A/B test design Why does designing a simple A/B test seem so complicated? A clear picture of power and significance in A/B tests Bandit Formulations for...
6995 sym 6 img
Practical Data Science with R examples
One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository. Some readers w...
2173 sym