Publications by John Mount

Use standard deviation (not mad about MAD)

19.01.2014

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get total...

5851 sym 4 img

Bad Bayes: an example of why you need hold-out testing

01.02.2014

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k...

6898 sym R (6028 sym/9 pcs) 2 img

Unprincipled Component Analysis

10.02.2014

As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) belie...

25142 sym 8 img

The gap between data mining and predictive models

20.02.2014

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool...

10926 sym 8 img

One day discount on Practical Data Science with R

21.02.2014

Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/. Related posts: Data Science, Machine Learning, and Statistics: what is in a name? Setting expectations in data science projects Data science project plan...

757 sym 2 img

Some statistics about the book

04.03.2014

The release date for Zumel, Mount “Practical Data Science with R” is getting close. I thought I would share a few statistics about what goes into this kind of book. “Practical Data Science with R” started formal work in October of 2012. We had always felt the Win-Vector blog represented practice and research for such an effort, but thi...

2771 sym 4 img

You don’t need to understand pointers to program using R

01.04.2014

R is a statistical analysis package based on writing short scripts or programs (versus being based on GUIs like spreadsheets or directed workflow editors). I say “writing short scripts” because R’s programming language (itself called S) is a bit of an oddity that you really wouldn’t be using except it gives you access to superior analyti...

6842 sym 2 img

Old tails: a crude power law fit on ebook sales

18.04.2014

We use R to take a very brief look at the distribution of e-book sales on Amazon.com. Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey’s analysis tries to break sales down by declared category and source, but ther...

5174 sym 4 img

A bit of the agenda of Practical Data Science with R

01.05.2014

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production. Our plan to teach is to: Order t...

8141 sym 2 img

A clear picture of power and significance in A/B tests

03.05.2014

A/B tests are one of the simplest reliable experimental designs. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. “Practical guide to controlled experiments on the web: listen to your customers not to the HIPPO” Ron Kohavi, Randa...

9381 sym 6 img