Publications by John Mount
New screencast: using R and RStudio to install and experiment with Apache Spark
I have new short screencast up: using R and RStudio to install and experiment with Apache Spark. More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.co...
616 sym
Another R [Non-]Standard Evaluation Idea
Jonathan Carroll had a an interesting R language idea: to use @-notation to request value substitution in a non-standard evaluation environment (inspired by msyql User-Defined Variables). He even picked the right image: The idea is kind of reverse from some Lisp ideas (“evaled unless ticked”), but an interesting possibility. We can play alon...
2424 sym R (679 sym/4 pcs) 2 img
Practical Data Science with R: ACM SIGACT News Book Review and Discount!
Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)! The book is half off at Manning form March 21st 2017 using the following code (please share/Tweet): Deal of the Day March 21: Ha...
3389 sym 2 img
Datashader is a big deal
I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics. I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions. I am so excited ab...
5429 sym R (574 sym/1 pcs) 16 img
Debugging Pipelines in R with Bizarro Pipe and Eager Assignment
This is a note on debugging magrittr pipelines in R using Bizarro Pipe and eager assignment. Pipes in R The magrittr R package supplies an operator called “pipe” which is written as “%>%“. The pipe operator is partly famous due to its extensive use in dplyr and use by dplyr users. The pipe operator is roughly described as allowing one t...
5432 sym R (2608 sym/8 pcs) 8 img
Coordinatized Data: A Fluid Data Specification
Authors: John Mount and Nina Zumel. Introduction It’s been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting). Boris Artzybasheff illustration Real trus...
14344 sym R (12848 sym/59 pcs) 18 img
Visualizing relational joins
I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join: Please read on for my discussion of this diagram and teaching joins. Teaching joins In the above diagram two tables are...
3079 sym 32 img
Encoding categorical variables: one-hot and beyond
(or: how to correctly use xgboost from R) R has “one-hot” encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding in the variable names chosen by a linear regressi...
5854 sym R (9424 sym/37 pcs) 6 img
Programming over R
R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with. ...
4547 sym R (511 sym/5 pcs) 2 img
Why to use wrapr::let()
I have written about referential transparency before. In this article I would like to discuss “leaky abstractions” and why wrapr::let() supplies a useful (but leaky) abstraction for R programmers. Abstractions A common definition of an abstraction is (from the OSX dictionary): the process of considering something independently of its associ...
6326 sym R (1006 sym/6 pcs) 4 img