Publications by John Mount
New R package: replyr (get a grip on remote dplyr data services)
It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at ...
5034 sym R (1883 sym/6 pcs) 2 img
vtreat data cleaning and preparation article now available on arXiv
Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP]. vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares var...
2758 sym
Be careful evaluating model predictions
One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound...
7495 sym 2 img
Parametric variable names and dplyr
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatmen...
7603 sym 2 img
The case for index-free data manipulation
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation....
6672 sym 2 img
The Case For Using -> In R
R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics). The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. Don Quijote and Sancho Pan...
8649 sym 2 img
magrittr’s Doppelgänger
R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pi...
5894 sym 4 img
Organize your data manipulation in terms of “grouped ordered apply”
Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely e...
14377 sym 2 img
help(let, package=’replyr’)
A bit more on our replyr R package. library("replyr") help(let, package='replyr') let {replyr} R Documentation Prepare expr for execution with name substitutions specified in alias. Description replyr::let implements a mapping from desired names (names used directly in the expr code) to names used in the data. Mnemonic: “expr code symbol...
2232 sym R (1882 sym/2 pcs) 2 tbl
Comparative examples using replyr:let
Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier. Archie’s Mechanics #2 (1954) copyright Archie Publications (ed...
6006 sym R (2109 sym/12 pcs) 2 img