Publications by John Mount

sample(): “Monkey’s Paw” style programming in R

22.03.2016

The R functions base::sample and base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matter...

5852 sym 2 img

Upcoming Win-Vector LLC appearances

23.03.2016

Win-Vector LLC will be presenting on statistically validating models using R and data science at: Strata+Hadoop World “R Day” Tutorial 9:00am–5:00pm Tuesday, March 29 2016, San Jose, California. ODSC San Francisco Meetup, 6:30pm-9:00pm Thursday, March 31, 2016, San Francisco, California. We will share code and examples. Registration requi...

834 sym 4 img 1 tbl

For loops in R can lose class information

24.03.2016

Did you know R‘s for() loop control structure drops class annotations from vectors? Consider the following code R code demonstrating three uses of a for-loop that one would expect to behave very similarly. dates <- c(as.Date('2015-01-01'),as.Date('2015-01-02')) for(ii in seq_along(dates)) { di <- dates[ii] print(di) } ## [1] "2015-01-01"...

3144 sym

WVPlots: example plots in R using ggplot2

01.04.2016

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (whic...

3391 sym 10 img

A bit on the F1 score floor

02.04.2016

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.” We repeated our usual admonition to not use “accuracy” as a project goal (business people tend to ask for it as it is the word they are most familiar with, but i...

4929 sym 4 img

Half off Win-Vector data science books and video training!

08.04.2016

We are pleased to announce our book Practical Data Science with R (Nina Zumel, John Mount, Manning 2014) is part of Manning’s “Deal of the Day” of April 9th 2016. This one day only offer gets you half off for physical book (with free e-copy) or paid e-copy (e-copy simultaneous pdf + ePub + kindle, and DRM free!). Here is the discount count...

1171 sym

Free data science video lecture: debugging in R

09.04.2016

We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will really improve your debugging ga...

810 sym

Improved vtreat documentation

17.04.2016

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically justifiable manner. Even with modern machine learning techniques (random forests, ...

3847 sym 2 img

On Nested Models

26.04.2016

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nes...

6811 sym R (2351 sym/7 pcs) 4 img

vtreat cross frames

05.05.2016

vtreat cross frames John Mount, Nina Zumel 2016-05-05 As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat. Consider the following data frame. The outcome only depends on the “good” variables, not on the (high degree of freedom) “bad” variables. ...

4896 sym R (9416 sym/27 pcs) 4 img