Publications by John Mount

Practical Data Science with R October 2013 update

26.10.2013

A quick status update on our upcoming book “Practical Data Science with R” by Nina Zumel and John Mount. We are really happy with how the book is coming out. We were able to cover most everything we hoped to. Part 1 (especially chapter 3) is already being used in courses, and has some very good stuff on how to review data. Part 2 covers th...

2144 sym 2 img

Practical Data Science with R: Manning Deal of the Day November 19th 2013

19.11.2013

Please share: Manning Deal of the Day November 19: Half off Practical Data Science with R. Use code dotd1119au at www.manning.com/zumel/. Related posts: Data Science, Machine Learning, and Statistics: what is in a name? Data science project planning Setting expectations in data science projects Related To leave a comment for the author, please...

701 sym 2 img

Sample size and power for rare events

03.12.2013

We have written a bit on sample size for common events. We would like to extend this analysis to rare events. In web marketing and a lot of other applications you are trying to estimate a probability of an event (like conversion) where the probability is fairly low (say 5% to 0.5%). In this case we our rules of thumb given in 1 and 2 are a bit...

3431 sym R (849 sym/1 pcs) 14 img

Generalized linear models for predicting rates

01.01.2014

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialize...

8119 sym 6 img

Use standard deviation (not mad about MAD)

19.01.2014

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get total...

5851 sym 4 img

Bad Bayes: an example of why you need hold-out testing

01.02.2014

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k...

6898 sym R (6028 sym/9 pcs) 2 img

Unprincipled Component Analysis

10.02.2014

As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) belie...

25142 sym 8 img

The gap between data mining and predictive models

20.02.2014

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool...

10926 sym 8 img

One day discount on Practical Data Science with R

21.02.2014

Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/. Related posts: Data Science, Machine Learning, and Statistics: what is in a name? Setting expectations in data science projects Data science project plan...

757 sym 2 img

Some statistics about the book

04.03.2014

The release date for Zumel, Mount “Practical Data Science with R” is getting close. I thought I would share a few statistics about what goes into this kind of book. “Practical Data Science with R” started formal work in October of 2012. We had always felt the Win-Vector blog represented practice and research for such an effort, but thi...

2771 sym 4 img