Publications by John Mount

New R package: replyr (get a grip on remote dplyr data services)

22.11.2016

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at ...

5034 sym R (1883 sym/6 pcs) 2 img

vtreat data cleaning and preparation article now available on arXiv

30.11.2016

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP]. vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares var...

2758 sym

Be careful evaluating model predictions

02.12.2016

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound...

7495 sym 2 img

Parametric variable names and dplyr

03.12.2016

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatmen...

7603 sym 2 img

The case for index-free data manipulation

10.12.2016

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation....

6672 sym 2 img

The Case For Using -> In R

12.12.2016

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics). The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. Don Quijote and Sancho Pan...

8649 sym 2 img

magrittr’s Doppelgänger

13.12.2016

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pi...

5894 sym 4 img

Organize your data manipulation in terms of “grouped ordered apply”

15.12.2016

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely e...

14377 sym 2 img

help(let, package=’replyr’)

17.12.2016

A bit more on our replyr R package. library("replyr") help(let, package='replyr') let {replyr} R Documentation Prepare expr for execution with name substitutions specified in alias. Description replyr::let implements a mapping from desired names (names used directly in the expr code) to names used in the data. Mnemonic: “expr code symbol...

2232 sym R (1882 sym/2 pcs) 2 tbl

Comparative examples using replyr:let

22.12.2016

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier. Archie’s Mechanics #2 (1954) copyright Archie Publications (ed...

6006 sym R (2109 sym/12 pcs) 2 img