Publications by John Mount

Proofing statistics in papers

02.10.2016

Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for statistical errors. Please read on for how th...

5234 sym 2 img 1 tbl

Adding polished significance summaries to papers using R

04.10.2016

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list: To test ...

5003 sym

On calculating AUC

07.10.2016

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated...

4785 sym 2 img

Data science for executives and managers

21.10.2016

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitio...

2872 sym

A quick look at RStudio’s R notebooks

22.10.2016

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (see http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ ) Related To leave a comment for the author, please follow the link and comme...

672 sym

Some vtreat design principles

01.11.2016

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design....

5102 sym 2 img

Laplace noising versus simulated out of sample methods (cross frames)

09.11.2016

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’...

9247 sym 6 img

You should re-encode high cardinality categorical variables

11.11.2016

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product code...

4367 sym 2 img

Teaching Practical Data Science with R

16.11.2016

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of different sections of the book.With Practical Data Science with R we wanted t...

4195 sym 2 img

MySql in a container

19.11.2016

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL work environments. I...

4046 sym