new programming with data.table

new programming with data.table

baby steps creating handy functions with the new data.table programming interface

The newest version of data.table has hit CRAN, and there are lots of great new features.

Among them, a %notin% function, a new let function that can be used instead of := ( I wasn’t too fussed about this originally but have tried it a few times today and I may well adopt it - although I do like that := really stands out in my code when assigning / updating variables).

The big feature is the new programming interface. I have blogged about programming on data.table before, but things have moved on.

In my packages currently, I use get to retrieve variable names (that, and a rather tortuous method of grabbing the original names, setting them to something else, then switching them back at the end). I no longer need to do this, which is particularly handy, as my {spccharter} package has been sitting dormant while awaiting this new programming approach.

A few examples of making it work - first of all, a handy descending sort function, as I find myself doing this a lot.

library(data.table) 
library(dplyr) # we'll mimic some of the dplyr examples later on
library(palmerpenguins) # about time I used this, I suppose

I’ll be honest, I was a bit lost with how to approach this, but Jan Gorecki saw my post and gave me a nudge in the right direction (not for the first time - thanks Jan!)

For posterity, here is my first attempt, which worked when supplying a quoted variable, but not an unquoted one.

sorted2 <- function(.DT = DT, x) {
  res <- .DT[,.N, .(V1 = x),
             env = list(x = x)][order(-N)]
  setnames(res, "V1", x)
  res
}

All I needed was to add in the env = list(x = substitute(x)) to the end of the first line of my function.

This is how it should have been:

  .DT[, .N, x, 
      env = list(x = substitute(x))
      ][order(-N)]
}

Let’s test it, and, for the first time on this blog, I’ll use PalmerPenguins for an example

pingu <- setDT(copy(palmerpenguins::penguins))  # I ain't typing "penguins" over and over
names(pingu) # have avoided this dataset for years so need a reminder of what's actually in it
descending_sort(pingu, species)
#     species     N
#      <fctr> <int>
#1:    Adelie   152
#2:    Gentoo   124
#3: Chinstrap    68

That works for one variable, here is a function that sorts any number of variables

descending_group_sort <- function(.DT, ...) {
  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())
  .DT[,
      .N,
      by = vars,
      env = list(vars = substitute(vars))
      ][order(-N)]
}
descending_group_sort(pingu, flipper_length_mm, body_mass_g)

(You’ll have to trust me, I don’t want to paste 306 rows here)

Now let’s nick some examples from dplyr, and mimic them with our new data.table functionality:

## dplyr examples
var_summary <- function(data, var) {
  data %>%
    summarise(n = n(),
              min = min(),
              max = max())
}
mtcars %>%
  group_by(cyl) %>%
  var_summary(mpg)
# A tibble: 3 × 4
#    cyl     n   min   max
#  <dbl> <int> <dbl> <dbl>
#1     4    11  21.4  33.9
#2     6     7  17.8  21.4
#3     8    14  10.4  19.2

And here’s the same with the new release of data.table:

mtc <- setDT(copy(mtcars)) # copy and turn it into a data.table
setorder(mtc, cyl) # ensure the order matches the dplyr results
var_summary_dt <- function(data, var, grp) {
  data[, .(n = .N,
           min = min(var),
           max = max(var)),
       .(grp),
       env = list(var = substitute(var),
                  grp = substitute(grp))]
}
var_summary_dt(mtc, mpg, cyl)
#    cyl     n   min   max
#   <num> <int> <num> <num>
#1:     4    11  21.4  33.9
#2:     6     7  17.8  21.4
#3:     8    14  10.4  19.2

Looks good to me!
Again, all we had to do was add in the calls to substitute in the env

Here are some further dplyr examples - a summary function for one or more variables from the starwars dataset:

my_summarise <- function(.data, ...) {
  .data %>%
    group_by(...) %>%
    summarise(mass = mean(mass, na.rm = TRUE),
              height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld) # too many rows to print here
starwars %>% my_summarise(sex, gender)
# A tibble: 6 × 4
# Groups:   sex [5]
# sex            gender      mass height
# <chr>          <chr>      <dbl>  <dbl>
#1 female         feminine    54.7   172.
#2 hermaphroditic masculine 1358     175 
#3 male           masculine   80.2   179.
#4 none           feminine   NaN      96 
#5 none           masculine   69.8   140 
#6 NA             NA          81     175 

And the data.table equivalent

starwars_dt <- setDT(copy(starwars))
my_summarise_dt <- function(.dt, ...) {
  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())
  .dt[, lapply(.SD, mean, na.rm = TRUE),
      .SDcols =c("mass", "height"),
      by = vars,
      env = list(vars = substitute(vars))][]
}

Let’s try it:

my_summarise_dt(.dt = starwars_dt, homeworld) # too many rows to print
my_summarise_dt(.dt = starwars_dt, sex, gender) # same as dplyr
#              sex    gender       mass   height
#           <char>    <char>      <num>    <num>
#1:           male masculine   80.21905 179.1228
#2:           none masculine   69.75000 140.0000
#3:         female  feminine   54.68889 171.5714
#4: hermaphroditic masculine 1358.00000 175.0000
#5:           <NA>      <NA>   81.00000 175.0000
#6:           none  feminine        NaN  96.0000

Edit

Jan saw my post and suggested I mention the option to turn on verbose = TRUE. This can help when debugging issues with the env.

Here’s how that last function looks:

my_summarise_dt <- function(.dt, ...) {
  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())
  .dt[, lapply(.SD, mean, na.rm = TRUE),
      .SDcols = c("mass", "height"),
      by = vars,
      env = list(vars = substitute(vars)), 
      verbose = TRUE][]
}

So you just add the verbose = TRUE part before the final closing square bracket, after your env definition.
These are the results from calling my_summarise_dt(.dt = starwars_dt, sex, gender).

Argument 'by' after substitute: list(sex, gender)
Argument 'j' after substitute: lapply(.SD, mean, na.rm = TRUE)
Finding groups using forderv ... forder.c received 87 rows and 2 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 6
0.000s elapsed (0.000s cpu) 
lapply optimization changed j from 'lapply(.SD, mean, na.rm = TRUE)' to 'list(mean(mass, na.rm = TRUE), mean(height, na.rm = TRUE))'
GForce optimized j to 'list(gmean(mass, na.rm = TRUE), gmean(height, na.rm = TRUE))' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gmean took (narm=TRUE) ... gather took ... 0.003s
0.004s
This gmean took (narm=TRUE) ... gather took ... 0.000s
0.001s
gforce eval took 0.005
0.010s elapsed (0.000s cpu) 
              sex    gender       mass   height
           <char>    <char>      <num>    <num>
1:           male masculine   80.21905 179.1228
2:           none masculine   69.75000 140.0000
3:         female  feminine   54.68889 171.5714
4: hermaphroditic masculine 1358.00000 175.0000
5:           <NA>      <NA>   81.00000 175.0000
6:           none  feminine        NaN  96.0000

We can see that our grouping columns are interpreted correctly.

Disclaimer

I’m not sure about the use of eval(substitute(alist(...)), envir = parent.frame()).

It works, but may not be what the data.table devs intended (feel free to put me right)

This does work, but there may be some sort of unintended cosequences that have not yet smacked me in the face. That said, data.table’s error handling is so freakishly accurate (like, in the room watching over your shoulder) that any issues should be easy enough to solve.

There’s a lot more to this new approach, but I generally only usually need to work with column names, rather than creating super flexible functions, so this little bit of knowledge is more than enough to keep me going for now.
I will update as I figure more things out. In the meantime, you should check out the new version of data.table, and find your favourite new feature.

0

© 2016 - 2024. All rights reserved.