Publications by John Mount

Base R can be Fast

15.01.2018

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up t...

2260 sym 2 img

Data Reshaping with cdata

17.01.2018

I’ve just shared a short webcast on data reshaping in R using the cdata package. (link) We also have two really nifty articles on the theory and methods: Fluid data reshaping with cdata Coordinatized Data: A Fluid Data Specification Please give it a try! This is the material I recently presented at the January 2017 BARUG Meetup. Related To...

738 sym 2 img

Advisory on Multiple Assignment dplyr::mutate() on Databases

21.01.2018

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases. (image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License) In this note I exhibit a troublesome example, and a systematic solution. First let’s set up dplyr, our database, and some example data. library(...

2977 sym R (1723 sym/11 pcs) 2 img 2 tbl

Latest vtreat up on CRAN

24.01.2018

There is a new version of the R package vtreat now up on CRAN. vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including: High cardinality categorical variables Rare levels (including new or novel levels during application) in categorical variab...

1331 sym

Supercharge your R code with wrapr

27.01.2018

I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code. Img: Christopher Ziemnowicz. Named Map Builder First I will demonstrate wrapr‘s “named map builder”: :=. The named map builder adds names to vectors and lists by nice “names on the left and values on the right” notation. For example to b...

8390 sym R (4316 sym/34 pcs) 2 img 2 tbl

Why No Exact Permutation Tests at Scale?

01.02.2018

Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot. Permutation tests have t...

6631 sym R (264 sym/1 pcs) 2 img

Is 10,000 Cells Big?

12.02.2018

Trick question: is a 10,000 cell numeric data.frame big or small? In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later. Example Let’s look at ...

2642 sym R (3119 sym/37 pcs) 4 img

R Tip: Use qc() For Fast Legible Quoting

17.02.2018

Here is an R tip. Need to quote a lot of names at once? Use qc(). This is particularly useful in selecting columns from data.frames: library("wrapr") # get qc() definition head(mtcars[, qc(mpg, cyl, wt)]) # mpg cyl wt # Mazda RX4 21.0 6 2.620 # Mazda RX4 Wag 21.0 6 2.875 # Datsun 710 22.8 4 2.320...

579 sym R (450 sym/2 pcs)

R Tip: Use seq_len() to Avoid The Backwards Sequence Bug

19.02.2018

Another R tip. Use seq_len() to avoid The backwards seqeunce bug. Many R users use the “colon sequence” notation to build sequences. For example: for(i in 1:5) { print(paste(i, i*i)) } #> [1] "1 1" #> [1] "2 4" #> [1] "3 9" #> [1] "4 16" However, the colon notation can be unsafe as it does not properly handle the empty sequence case: n <...

857 sym R (178 sym/3 pcs)

R Tip: Use [[ ]] Wherever You Can

21.02.2018

R tip: use [[ ]] wherever you can. In R the [[ ]] is the operator that (when supplied a scalar argument) pulls a single element out of lists (and the [ ] operator pulls out sub-lists). For vectors [[ ]] and [ ] appear to be synonyms. However, when writing reusable code you may not always be sure if your code is going to be applied to a vector or...

1432 sym R (125 sym/2 pcs)