Publications by John Mount

wrapr 1.5.0 available on CRAN

13.06.2018

The R package wrapr 1.5.0 is now available on CRAN. wrapr includes a lot of tools for writing better R code: let() (let block) %.>% (dot arrow pipe) build_frame() / draw_frame() ( data.frame builders and formatters ) qc() (quoting concatenate) := (named map builder) %?% (coalesce) NEW! %.|% (reduce/expand args) NEW! uniques() (safe unique() repl...

1381 sym R (143 sym/2 pcs)

R Tip: Be Wary of “…”

15.06.2018

R Tip: be wary of “...“. The following code example contains an easy error in using the R function unique(). vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c" Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value argumen...

5517 sym R (578 sym/8 pcs)

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

20.06.2018

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing. vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark. vtreat is a very complete and rigorous tool for prepa...

1384 sym

seplyr 0.5.8 Now Available on CRAN

02.07.2018

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN. seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. ...

4055 sym R (2417 sym/6 pcs)

Speed up your R Work

08.07.2018

Introduction In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages. For each of the above packages we speed up work by using wrapr::ex...

3743 sym R (10959 sym/63 pcs) 2 img

John Mount speaking on rquery and rqdatatable

11.07.2018

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks. Win-Vector LLC‘s John Mount will be speakin...

1587 sym 4 img

How to use rquery with Apache Spark on Databricks

26.07.2018

A big thank you to Databricks for working with us and sharing: rquery: Practical Big Data Transforms for R-Spark Users How to use rquery with Apache Spark on Databricks rquery on Databricks is a great data science tool. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-blogg...

622 sym 2 img

Collecting Expressions in R

05.08.2018

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node. This example may seem extreme or unnatural. However we have seen once you expose a system to enough users you see a lot more extreme use cases than you would at first expect. We hav...

2082 sym R (5658 sym/25 pcs) 2 img

Meta-packages, nails in CRAN’s coffin

07.08.2018

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”. This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact...

1910 sym 6 img

data.table is Really Good at Sorting

13.08.2018

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes. The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below. Notice on the above semi-log plot the run time ratio is gr...

1445 sym 6 img