Publications by John Mount
wrapr 1.5.0 available on CRAN
The R package wrapr 1.5.0 is now available on CRAN. wrapr includes a lot of tools for writing better R code: let() (let block) %.>% (dot arrow pipe) build_frame() / draw_frame() ( data.frame builders and formatters ) qc() (quoting concatenate) := (named map builder) %?% (coalesce) NEW! %.|% (reduce/expand args) NEW! uniques() (safe unique() repl...
1381 sym R (143 sym/2 pcs)
R Tip: Be Wary of “…”
R Tip: be wary of “...“. The following code example contains an easy error in using the R function unique(). vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c" Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value argumen...
5517 sym R (578 sym/8 pcs)
Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable
We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing. vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark. vtreat is a very complete and rigorous tool for prepa...
1384 sym
seplyr 0.5.8 Now Available on CRAN
We are pleased to announce that seplyr version 0.5.8 is now available on CRAN. seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. ...
4055 sym R (2417 sym/6 pcs)
Speed up your R Work
Introduction In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages. For each of the above packages we speed up work by using wrapr::ex...
3743 sym R (10959 sym/63 pcs) 2 img
John Mount speaking on rquery and rqdatatable
rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks. Win-Vector LLC‘s John Mount will be speakin...
1587 sym 4 img
How to use rquery with Apache Spark on Databricks
A big thank you to Databricks for working with us and sharing: rquery: Practical Big Data Transforms for R-Spark Users How to use rquery with Apache Spark on Databricks rquery on Databricks is a great data science tool. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-blogg...
622 sym 2 img
Collecting Expressions in R
Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node. This example may seem extreme or unnatural. However we have seen once you expose a system to enough users you see a lot more extreme use cases than you would at first expect. We hav...
2082 sym R (5658 sym/25 pcs) 2 img
Meta-packages, nails in CRAN’s coffin
Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”. This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact...
1910 sym 6 img
data.table is Really Good at Sorting
The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes. The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below. Notice on the above semi-log plot the run time ratio is gr...
1445 sym 6 img