Publications by John Mount

A budget of classifier evaluation measures

21.07.2016

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you ...

14579 sym R (4079 sym/11 pcs) 10 img

The magrittr monad

06.08.2016

Monads are a formal theory of composition where programmers get to invoke some very abstract mathematics (category theory) to argue the minutia of annotating, scheduling, sequencing operations, and side effects. On the positive side the monad axioms are a guarantee that related ways of writing code are in fact substitutable and equivalent; so you...

6749 sym R (1305 sym/8 pcs) 2 img

Can you nest parallel operations in R?

15.08.2016

When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics ...

4972 sym R (3276 sym/8 pcs)

The Win-Vector parallel computing in R series

16.08.2016

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details. For your convenience here they are in order: A gentle introduction to parallel computing in R Running R jobs ...

870 sym

My criticism of R numeric summary

18.08.2016

My criticism of R‘s numeric summary() method is: it unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good wo...

8277 sym 2 img

vtreat 0.5.27 released on CRAN

19.08.2016

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN. vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. (from the package documentation) Very roughly vtreat accepts an arbitrary “from the wild” da...

3371 sym

The R community is awesome (and fast)

30.08.2016

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems. In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem. Please read on for my list of n=3 in...

1853 sym

Variables can synergize, even in a linear model

01.09.2016

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improv...

6469 sym R (6889 sym/5 pcs)

Did she know we were writing a book?

03.09.2016

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge. Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Dat...

1297 sym 2 img

Relative error distributions, without the heavy tail theatrics

19.09.2016

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’...

12777 sym 17 img