Publications by John Mount
What is new in the vtreat library?
The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation. The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric tar...
5120 sym 2 img
My favorite R bug
In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find. The original problem I was working on was generating a training set as a subset of a simple dat...
4606 sym R (3136 sym/8 pcs) 4 img
R in a 64 bit world
32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limita...
7104 sym R (977 sym/6 pcs) 4 img
A bit about Win-Vector LLC
Win-Vector LLC is a consultancy founded in 2007 that specializes in research, algorithms, data-science, and training. (The name is an attempt at a mathematical pun.) Win-Vector LLC can complete your high value project quickly (some examples), and train your data science team to work much more effectively. Our consultants include the a...
820 sym
What is a good Sharpe ratio?
We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits). What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does ...
802 sym
A dynamic programming solution to A/B test design
Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test. For this article … C...
815 sym
vtreat up on CRAN!
Nina Zumel and I are proud to announce our R vtreat variable treatment library has just been accepted by CRAN! It will take some time for the vtreat package to progress to various CRAN mirrors, but as of now you can install vtreat with the command: install.packages('vtreat', repos='http://cran.r-project.org/') Instead of needing to use devt...
5925 sym 2 img
How do you know if your model is going to work? Part 3: Out of sample procedures
Authors: John Mount (more articles) and Nina Zumel (more articles). When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?”...
5630 sym 14 img
Using differential privacy to reuse training data
Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression. This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially prot...
12294 sym 24 img
Some key Win-Vector serial data science articles
As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence. What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series includ...
1634 sym 2 img