new programming with data.table

baby steps creating handy functions with the new data.table programming interface
The newest version of data.table has hit CRAN, and there are lots of great new features.
Among them, a %notin%
function, a new let
function that can be used instead of :=
( I wasn’t too fussed about this originally but have tried it a few times today and I may well adopt it - although I do like that :=
really stands out in my code when assigning / updating variables).
The big feature is the new programming interface. I have blogged about programming on data.table before, but things have moved on.
In my packages currently, I use get
to retrieve variable names (that, and a rather tortuous method of grabbing the original names, setting them to something else, then switching them back at the end). I no longer need to do this, which is particularly handy, as my {spccharter} package has been sitting dormant while awaiting this new programming approach.
A few examples of making it work - first of all, a handy descending sort function, as I find myself doing this a lot.
library(data.table)
library(dplyr) # we'll mimic some of the dplyr examples later on
library(palmerpenguins) # about time I used this, I suppose
I’ll be honest, I was a bit lost with how to approach this, but Jan Gorecki saw my post and gave me a nudge in the right direction (not for the first time - thanks Jan!)
For posterity, here is my first attempt, which worked when supplying a quoted variable, but not an unquoted one.
sorted2 <- function(.DT = DT, x) {
res <- .DT[,.N, .(V1 = x),
env = list(x = x)][order(-N)]
setnames(res, "V1", x)
res
}
All I needed was to add in the env = list(x = substitute(x))
to the end of the first line of my function.
This is how it should have been:
.DT[, .N, x,
env = list(x = substitute(x))
][order(-N)]
}
Let’s test it, and, for the first time on this blog, I’ll use PalmerPenguins for an example
pingu <- setDT(copy(palmerpenguins::penguins)) # I ain't typing "penguins" over and over
names(pingu) # have avoided this dataset for years so need a reminder of what's actually in it
descending_sort(pingu, species)
# species N
# <fctr> <int>
#1: Adelie 152
#2: Gentoo 124
#3: Chinstrap 68
That works for one variable, here is a function that sorts any number of variables
descending_group_sort <- function(.DT, ...) {
vars <- eval(substitute(alist(...)),
envir = parent.frame())
.DT[,
.N,
by = vars,
env = list(vars = substitute(vars))
][order(-N)]
}
descending_group_sort(pingu, flipper_length_mm, body_mass_g)
(You’ll have to trust me, I don’t want to paste 306 rows here)
Now let’s nick some examples from dplyr, and mimic them with our new data.table functionality:
## dplyr examples
var_summary <- function(data, var) {
data %>%
summarise(n = n(),
min = min(),
max = max())
}
mtcars %>%
group_by(cyl) %>%
var_summary(mpg)
# A tibble: 3 × 4
# cyl n min max
# <dbl> <int> <dbl> <dbl>
#1 4 11 21.4 33.9
#2 6 7 17.8 21.4
#3 8 14 10.4 19.2
And here’s the same with the new release of data.table:
mtc <- setDT(copy(mtcars)) # copy and turn it into a data.table
setorder(mtc, cyl) # ensure the order matches the dplyr results
var_summary_dt <- function(data, var, grp) {
data[, .(n = .N,
min = min(var),
max = max(var)),
.(grp),
env = list(var = substitute(var),
grp = substitute(grp))]
}
var_summary_dt(mtc, mpg, cyl)
# cyl n min max
# <num> <int> <num> <num>
#1: 4 11 21.4 33.9
#2: 6 7 17.8 21.4
#3: 8 14 10.4 19.2
Looks good to me!
Again, all we had to do was add in the calls to substitute
in the env
Here are some further dplyr examples - a summary function for one or more variables from the starwars dataset:
my_summarise <- function(.data, ...) {
.data %>%
group_by(...) %>%
summarise(mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld) # too many rows to print here
starwars %>% my_summarise(sex, gender)
# A tibble: 6 × 4
# Groups: sex [5]
# sex gender mass height
# <chr> <chr> <dbl> <dbl>
#1 female feminine 54.7 172.
#2 hermaphroditic masculine 1358 175
#3 male masculine 80.2 179.
#4 none feminine NaN 96
#5 none masculine 69.8 140
#6 NA NA 81 175
And the data.table equivalent
starwars_dt <- setDT(copy(starwars))
my_summarise_dt <- function(.dt, ...) {
vars <- eval(substitute(alist(...)),
envir = parent.frame())
.dt[, lapply(.SD, mean, na.rm = TRUE),
.SDcols =c("mass", "height"),
by = vars,
env = list(vars = substitute(vars))][]
}
Let’s try it:
my_summarise_dt(.dt = starwars_dt, homeworld) # too many rows to print
my_summarise_dt(.dt = starwars_dt, sex, gender) # same as dplyr
# sex gender mass height
# <char> <char> <num> <num>
#1: male masculine 80.21905 179.1228
#2: none masculine 69.75000 140.0000
#3: female feminine 54.68889 171.5714
#4: hermaphroditic masculine 1358.00000 175.0000
#5: <NA> <NA> 81.00000 175.0000
#6: none feminine NaN 96.0000
Edit
Jan saw my post and suggested I mention the option to turn on verbose = TRUE
. This can help when debugging issues with the env
.
Here’s how that last function looks:
my_summarise_dt <- function(.dt, ...) {
vars <- eval(substitute(alist(...)),
envir = parent.frame())
.dt[, lapply(.SD, mean, na.rm = TRUE),
.SDcols = c("mass", "height"),
by = vars,
env = list(vars = substitute(vars)),
verbose = TRUE][]
}
So you just add the verbose = TRUE
part before the final closing square bracket, after your env
definition.
These are the results from calling my_summarise_dt(.dt = starwars_dt, sex, gender)
.
Argument 'by' after substitute: list(sex, gender)
Argument 'j' after substitute: lapply(.SD, mean, na.rm = TRUE)
Finding groups using forderv ... forder.c received 87 rows and 2 columns
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 6
0.000s elapsed (0.000s cpu)
lapply optimization changed j from 'lapply(.SD, mean, na.rm = TRUE)' to 'list(mean(mass, na.rm = TRUE), mean(height, na.rm = TRUE))'
GForce optimized j to 'list(gmean(mass, na.rm = TRUE), gmean(height, na.rm = TRUE))' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gmean took (narm=TRUE) ... gather took ... 0.003s
0.004s
This gmean took (narm=TRUE) ... gather took ... 0.000s
0.001s
gforce eval took 0.005
0.010s elapsed (0.000s cpu)
sex gender mass height
<char> <char> <num> <num>
1: male masculine 80.21905 179.1228
2: none masculine 69.75000 140.0000
3: female feminine 54.68889 171.5714
4: hermaphroditic masculine 1358.00000 175.0000
5: <NA> <NA> 81.00000 175.0000
6: none feminine NaN 96.0000
We can see that our grouping columns are interpreted correctly.
Disclaimer
I’m not sure about the use of eval(substitute(alist(...)), envir = parent.frame())
.
It works, but may not be what the data.table devs intended (feel free to put me right)
This does work, but there may be some sort of unintended cosequences that have not yet smacked me in the face. That said, data.table’s error handling is so freakishly accurate (like, in the room watching over your shoulder) that any issues should be easy enough to solve.
There’s a lot more to this new approach, but I generally only usually need to work with column names, rather than creating super flexible functions, so this little bit of knowledge is more than enough to keep me going for now.
I will update as I figure more things out. In the meantime, you should check out the new version of data.table, and find your favourite new feature.