Publications by John Mount
Base R can be Fast
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up t...
2260 sym 2 img
Data Reshaping with cdata
I’ve just shared a short webcast on data reshaping in R using the cdata package. (link) We also have two really nifty articles on the theory and methods: Fluid data reshaping with cdata Coordinatized Data: A Fluid Data Specification Please give it a try! This is the material I recently presented at the January 2017 BARUG Meetup. Related To...
738 sym 2 img
Advisory on Multiple Assignment dplyr::mutate() on Databases
I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases. (image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License) In this note I exhibit a troublesome example, and a systematic solution. First let’s set up dplyr, our database, and some example data. library(...
2977 sym R (1723 sym/11 pcs) 2 img 2 tbl
Latest vtreat up on CRAN
There is a new version of the R package vtreat now up on CRAN. vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including: High cardinality categorical variables Rare levels (including new or novel levels during application) in categorical variab...
1331 sym
Supercharge your R code with wrapr
I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code. Img: Christopher Ziemnowicz. Named Map Builder First I will demonstrate wrapr‘s “named map builder”: :=. The named map builder adds names to vectors and lists by nice “names on the left and values on the right” notation. For example to b...
8390 sym R (4316 sym/34 pcs) 2 img 2 tbl
Why No Exact Permutation Tests at Scale?
Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot. Permutation tests have t...
6631 sym R (264 sym/1 pcs) 2 img
Is 10,000 Cells Big?
Trick question: is a 10,000 cell numeric data.frame big or small? In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later. Example Let’s look at ...
2642 sym R (3119 sym/37 pcs) 4 img
R Tip: Use qc() For Fast Legible Quoting
Here is an R tip. Need to quote a lot of names at once? Use qc(). This is particularly useful in selecting columns from data.frames: library("wrapr") # get qc() definition head(mtcars[, qc(mpg, cyl, wt)]) # mpg cyl wt # Mazda RX4 21.0 6 2.620 # Mazda RX4 Wag 21.0 6 2.875 # Datsun 710 22.8 4 2.320...
579 sym R (450 sym/2 pcs)
R Tip: Use seq_len() to Avoid The Backwards Sequence Bug
Another R tip. Use seq_len() to avoid The backwards seqeunce bug. Many R users use the “colon sequence” notation to build sequences. For example: for(i in 1:5) { print(paste(i, i*i)) } #> [1] "1 1" #> [1] "2 4" #> [1] "3 9" #> [1] "4 16" However, the colon notation can be unsafe as it does not properly handle the empty sequence case: n <...
857 sym R (178 sym/3 pcs)
R Tip: Use [[ ]] Wherever You Can
R tip: use [[ ]] wherever you can. In R the [[ ]] is the operator that (when supplied a scalar argument) pulls a single element out of lists (and the [ ] operator pulls out sub-lists). For vectors [[ ]] and [ ] appear to be synonyms. However, when writing reusable code you may not always be sure if your code is going to be applied to a vector or...
1432 sym R (125 sym/2 pcs)