Publications by John Mount

More Practical Data Science with R Book News

19.08.2018

Some more Practical Data Science with R news. Practical Data Science with R is the book we wish we had when we started in data science. Practical Data Science with R, Second Edition is the revision of that book with the packages we wish had been available at that time (in particular vtreat, cdata, and wrapr). A second edition also lets us also ...

1222 sym

R Tip: Consider radix Sort

21.08.2018

R tip: consider using radix sort. The “method = "radix"” option can greatly speed up sorting and ordering tables in R. For a 1 million row table the speedup is already as much as 35 times (around 9.6 seconds versus 3 tenths of a second). Below is an excerpt from an experiment sorting showing default settings and showing radix sort (full cod...

361 sym R (604 sym/2 pcs)

Timings of a Grouped Rank Filter Task

23.08.2018

Introduction This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task. Comparisons First let...

6814 sym R (724 sym/5 pcs) 8 img 3 tbl

R Tip: Put Your Values in Columns

29.08.2018

Today’s R tip is: put your values in columns. Some R users use different seemingly clever tricks to bring data to an analysis. Here is an (artificial) example. chamber_sizes <- mtcars$disp/mtcars$cyl form <- hp ~ chamber_sizes model <- lm(form, data = mtcars) print(model) # Call: # lm(formula = form, data = mtcars) # # Coefficients: # (Inter...

4380 sym R (681 sym/3 pcs)

R tip: How to Pass a formula to lm

01.09.2018

R tip : how to pass a formula to lm(). Often when modeling in R one wants to build up a formula outside of the modeling call. This allows the set of columns being used to be passed around as a vector of strings, and treated as data. Being able to treat controls (such as the set of variables to use) as manipulable values allows for very powerful...

4547 sym

R Tip: Give data.table a Try

08.09.2018

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table. For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here). If your project is large (...

952 sym

A Quick Appreciation of the R transform Function

10.09.2018

R users who also use the dplyr package will be able to quickly understand the following code that adds an estimated area column to a data.frame. suppressPackageStartupMessages(library("dplyr")) iris %>% mutate( ., Petal.Area = (pi/4)*Petal.Width*Petal.Length) %>% head(.) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Speci...

1108 sym R (1680 sym/6 pcs)

Practical Data Science with R2

12.09.2018

The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language. Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here: We also just got back our part-1 technical review for the new book. H...

1952 sym 2 img

Announcing wrapr 1.6.2

12.09.2018

wrapr 1.6.2 is now up on CRAN. We have some neat new features for R users to try (in addition to many earlier wrapr goodies). The first is the %in_block% alternate notation for let(). The wrapr let()-block allows easy replacement of names in name-capturing interfaces (such as transform()), as we show below. library("wrapr") column_mapping <- ...

2451 sym R (3602 sym/14 pcs) 2 img

Better R Code with wrapr Dot Arrow

15.09.2018

Our R package wrapr supplies a “piping operator” that we feel is a real improvement in R code piped-style coding. The idea is: with wrapr‘s “dot arrow” pipe “%.>%” the expression “A %.>% B” is treated very much like “{. <- A; B}". In particular this lets users think of "A %.>% B(.)" as a left-to-right way to write "B(A)" (i.e....

4783 sym R (2315 sym/30 pcs)