Pivoting data in R with tidyr and data.table

Pivoting data in R with tidyr and data.table

How to pivot data in R with data.table and tidyr. Examples of melt, dcast, pivot_longer and pivot_wider

We all need to pivot data at some point, so these are just some notes for my own benefit really, because gather and spread are no longer in favour within tidyr.

NB - this post has been updated with collapsible sections to show/hide the data and outputs.

I tended to only ever need gather, and nearly always relied on the same key and value names, so it was an easy function for me to use.

pivot_longer and pivot_wider are much more flexible, they just take a little bit more thinking about.

For example, my old approach has now changed along these lines

# old way with `gather`
df %>%
  mutate(row = row_number()) %>%
  gather('column', 'source', -row, -N)  # key = column, value = source, retain row and N
  # further transforms
# new way with pivot_longer
df %>%
  mutate(row = row_number()) %>%
  pivot_longer(!c(row , N), 
  names_to = 'column', 
  values_to  = 'source') 
  # further transforms

However, what I really want to do is show how to replicate much of the tidyr pivot functionality with data.table.

Once again, this is not intended to be in-depth.
I have simply used the tidyr help file code, and tried to replicate it with data.table.
I’d be interested in improvements to my data.table code.

Let’s pivot!

Note - in all examples, I’ll create a copy of the data set as a data.table using setDT(copy(source_data))

I’m using the base pipe for simplicity, so aside from that, all you need is

library(tidyr)
library(data.table)

tidyr::pivot_longer() ~ data.table::melt()

Using the built-in relig_income dataset:

Show data
## # A tibble: 18 × 11
##    religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Agnostic           27      34      60      81      76     137     122     109
##  2 Atheist            12      27      37      52      35      70      73      59
##  3 Buddhist           27      21      30      34      33      58      62      39
##  4 Catholic          418     617     732     670     638    1116     949     792
##  5 Don’t know/r…      15      14      15      11      10      35      21      17
##  6 Evangelical …     575     869    1064     982     881    1486     949     723
##  7 Hindu               1       9       7       9      11      34      47      48
##  8 Historically…     228     244     236     238     197     223     131      81
##  9 Jehovah's Wi…      20      27      24      24      21      30      15      11
## 10 Jewish             19      19      25      25      30      95      69      87
## 11 Mainline Prot     289     495     619     655     651    1107     939     753
## 12 Mormon             29      40      48      51      56     112      85      49
## 13 Muslim              6       7       9      10       9      23      16       8
## 14 Orthodox           13      17      23      32      32      47      38      42
## 15 Other Christ…       9       7      11      13      13      14      18      14
## 16 Other Faiths       20      33      40      46      49      63      46      40
## 17 Other World …       5       2       3       4       2       7       3       4
## 18 Unaffiliated      217     299     374     365     341     528     407     321
## # … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
## #   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
## #   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`


Code comparison

relig_income |>
  pivot_longer(!religion, # keep religion as a column
               names_to = "income", # desired name for new column
               values_to = "count") # what data goes into the new column?
melt(DT, id.vars = "religion",
     variable.name = "income",
     value.name = "count",
     variable.factor = FALSE) # added to keep output consistent with tidyr

With data.table, you can often get away with only supplying either measure.vars or id.vars, and nothing else, and it does a pretty great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.

You can compare outputs here:

pivot_longeroutput
## # A tibble: 180 × 3
##    religion income             count
##    <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows


meltoutput
##                   religion             income count
##   1:              Agnostic              <$10k    27
##   2:               Atheist              <$10k    12
##   3:              Buddhist              <$10k    27
##   4:              Catholic              <$10k   418
##   5:    Don’t know/refused              <$10k    15
##  ---                                               
## 176:              Orthodox Don't know/refused    73
## 177:       Other Christian Don't know/refused    18
## 178:          Other Faiths Don't know/refused    71
## 179: Other World Religions Don't know/refused     8
## 180:          Unaffiliated Don't know/refused   597


Drop missing values

Now, a slightly more complex case where columns have common prefix, and missings are structural so should be dropped, using the billboard dataset:

show billboard data
## A tibble: 317 × 79
## artist      track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9  wk10  wk11  wk12  wk13  wk14  
##  <chr>       <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
## 1 2 Pac       Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA    NA    NA    NA    NA    NA   
## 2 2Ge+her     The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## 3 3 Doors Do… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51    51    51    51    47    44   
## 4 3 Doors Do… Loser 2000-10-21    76    76    72    69    67    65    55    59    62    61    61    59    61    66    
## 5 504 Boyz    Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53    57    64    70    75    76    
## 6 98^0        Give… 2000-08-19    51    39    34    26    26    19     2     2     3     6     7    22    29    36   
## 7 A*Teens     Danc… 2000-07-08    97    97    96    95   100    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## 8 Aaliyah     I Do… 2000-01-29    84    62    51    41    38    35    35    38    38    36    37    37    38    49   
## 9 Aaliyah     Try … 2000-03-18    59    53    38    28    21    18    16    14    12    10     9     8     6     1   
## 10 Adams, Yol… Open… 2000-08-26    76    76    74    69    68    67    61    58    57    59    66    68    61    67  
## … with 307 more rows, 60 more variables: wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 
##   wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 
##   wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 
##   wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 
##   wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 
##   wk64 <dbl>, wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>, wk73 
##   wk74 <lgl>, wk75 <lgl>, wk76 <lgl>, and abbreviated variable name ¹​date.entered
## ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names


Code comparison:

billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )
melt(DT,
     measure.vars = patterns("^wk"),
     variable.name = "week",
     value.name = "rank",
     na.rm = TRUE)
pivot_longeroutput
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
## # … with 5,297 more rows


meltoutput
##             artist                   track date.entered week rank
##    1:        2 Pac Baby Don't Cry (Keep...   2000-02-26  wk1   87
##    2:      2Ge+her The Hardest Part Of ...   2000-09-02  wk1   91
##    3: 3 Doors Down              Kryptonite   2000-04-08  wk1   81
##    4: 3 Doors Down                   Loser   2000-10-21  wk1   76
##    5:     504 Boyz           Wobble Wobble   2000-04-15  wk1   57
##   ---                                                            
## 5303:        Creed                  Higher   1999-09-11 wk63   50
## 5304:     Lonestar                  Amazed   1999-06-05 wk63   45
## 5305:        Creed                  Higher   1999-09-11 wk64   50
## 5306:     Lonestar                  Amazed   1999-06-05 wk64   50
## 5307:        Creed                  Higher   1999-09-11 wk65   49


Multiple variables stored in column names

Using the who dataset

show who data
## A tibble: 7,240 × 60
##   country iso2  iso3   year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
##  <chr>   <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghan… AF    AFG    1980      NA      NA      NA      NA      NA      NA
## 2 Afghan… AF    AFG    1981      NA      NA      NA      NA      NA      NA
## 3 Afghan… AF    AFG    1982      NA      NA      NA      NA      NA      NA
## 4 Afghan… AF    AFG    1983      NA      NA      NA      NA      NA      NA
## 5 Afghan… AF    AFG    1984      NA      NA      NA      NA      NA      NA
## 6 Afghan… AF    AFG    1985      NA      NA      NA      NA      NA      NA
## 7 Afghan… AF    AFG    1986      NA      NA      NA      NA      NA      NA
## 8 Afghan… AF    AFG    1987      NA      NA      NA      NA      NA      NA
## 9 Afghan… AF    AFG    1988      NA      NA      NA      NA      NA      NA
## 10 Afghan… AF    AFG    1989      NA      NA      NA      NA      NA      NA
## … with 7,230 more rows, 50 more variables: new_sp_m65 <dbl>,
##   new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
##   new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
##   new_sp_f65 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
##   new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
##   new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>,
##   new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, …
## ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names


who |> pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)", # Whoa
  values_to = "count"
)
 # NB -  the development version of data.table 
 # makes this much easier
melt(DT, value.name = "count",
     measure.vars = measure(
  diagnosis, gender, ages,
  pattern = "new_?(.*)_(.)(.*)"))
# OR 
DT[,melt(.SD, value.name = "count",
         measure.vars = measure(
           diagnosis, gender, ages,
           pattern = "new_?(.*)_(.)(.*)"))] 

This is how I did it in the current version, without blowing up my laptop:

# current version:  
DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][, !c("variable","temp")
                      ][,c(1:4,6:8,5)][]

OK - we’re going to need to break this one down.

Firstly, note that this time round, I am using melt inside of DT, rather than melt(DT)

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][]

This returns a new variablecolumn, which needs splitting up into more columns.

First, we need to get rid of that new_, using gsub to remove it.
Then, I want the diagnosis part, so I create a diagnosis column, and a “temp” column for the remaining bits, which I’ll split in the next passes

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][]

Now to split out the gender column, from the new temp column. This is going to be either an “m” or an “f”.

Then retrieve the age, again from the temp column. This generates two vectors, I only want the second one, hence the keep = 2

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][]

The last two lines discard the variable and temp columns, and sort the columns into the same order as that returned by pivot_longer

Final output
##             country iso2 iso3 year diagnosis gender age count
##      1: Afghanistan   AF  AFG 1980        sp      m 014    NA
##      2: Afghanistan   AF  AFG 1981        sp      m 014    NA
##      3: Afghanistan   AF  AFG 1982        sp      m 014    NA
##      4: Afghanistan   AF  AFG 1983        sp      m 014    NA
##      5: Afghanistan   AF  AFG 1984        sp      m 014    NA
##     ---                                                      
## 405436:    Zimbabwe   ZW  ZWE 2009       rel      f  65    NA
## 405437:    Zimbabwe   ZW  ZWE 2010       rel      f  65    NA
## 405438:    Zimbabwe   ZW  ZWE 2011       rel      f  65    NA
## 405439:    Zimbabwe   ZW  ZWE 2012       rel      f  65    NA
## 405440:    Zimbabwe   ZW  ZWE 2013       rel      f  65   725


Convert matrix to long

Using the anscombe dataset

show anscombe data
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89


Here’s what we’re aiming for:

target output
## # A tibble: 44 × 3
##    set       x     y
##    <chr> <dbl> <dbl>
##  1 1        10  8.04
##  2 1         8  6.95
##  3 1        13  7.58
##  4 1         9  8.81
##  5 1        11  8.33
##  6 1        14  9.96
##  7 1         6  7.24
##  8 1         4  4.26
##  9 1        12 10.8 
## 10 1         7  4.82
## # … with 34 more rows


The two approaches look like this:

anscombe |>
  pivot_longer(
    everything(),
    cols_vary = "slowest",
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)" 
  )
DT[,melt(.SD,
            variable.name = "set",
            value.name = c("x","y"),
            variable.factor = FALSE,
            measure.vars = patterns("^x","^y"))]

This is the output from melt - we’ve seen the pivot_longer outputs above

show output
##     set  x     y
##  1:   1 10  8.04
##  2:   1  8  6.95
##  3:   1 13  7.58
##  4:   1  9  8.81
##  5:   1 11  8.33
##  6:   1 14  9.96
##  7:   1  6  7.24
##  8:   1  4  4.26
##  9:   1 12 10.84
## 10:   1  7  4.82
## 11:   1  5  5.68
## 12:   2 10  9.14
## 13:   2  8  8.14
## 14:   2 13  8.74
## 15:   2  9  8.77
## 16:   2 11  9.26
## 17:   2 14  8.10
## 18:   2  6  6.13
## 19:   2  4  3.10
## 20:   2 12  9.13
## 21:   2  7  7.26
## 22:   2  5  4.74
## 23:   3 10  7.46
## 24:   3  8  6.77
## 25:   3 13 12.74
## 26:   3  9  7.11
## 27:   3 11  7.81
## 28:   3 14  8.84
## 29:   3  6  6.08
## 30:   3  4  5.39
## 31:   3 12  8.15
## 32:   3  7  6.42
## 33:   3  5  5.73
## 34:   4  8  6.58
## 35:   4  8  5.76
## 36:   4  8  7.71
## 37:   4  8  8.84
## 38:   4  8  8.47
## 39:   4  8  7.04
## 40:   4  8  5.25
## 41:   4 19 12.50
## 42:   4  8  5.56
## 43:   4  8  7.91
## 44:   4  8  6.89
##     set  x     y


We’ve done wide to long, now for the opposite:

tidyr::pivot_wider() ~ data.table::dcast()

I have to be honest, dcast is not a term that makes me think long to wide, but you can’t have everything.

We’re using this fish_encounters dataset for this bit.
It’s about time fish made an appearance on this blog.
Probably.

show fish_encounters
## # A tibble: 114 × 3
##    fish  station  seen
##    <fct> <fct>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows


Here’s what we want:

show target output
## # A tibble: 19 × 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
##  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
##  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
##  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
##  8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
##  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 12 4857        1     1      1     1       1     1     1     1     1    NA    NA
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1    NA    NA
## 17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA


Code comparison:

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen)
dcast(DT, fish ~ station, value.var = "seen") 

Technically, we can avoid stating that value.var = "seen", but we do then get a large informational message in the console, which is always disconcerting. (If data.table tells you something, you’re well advised to act on it, because the help is scarily accurate)

Here are the data.table outputs:

show dcast output
##     fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
##  1: 4842       1     1      1    1       1   1   1    1    1   1   1
##  2: 4843       1     1      1    1       1   1   1    1    1   1   1
##  3: 4844       1     1      1    1       1   1   1    1    1   1   1
##  4: 4845       1     1      1    1       1  NA  NA   NA   NA  NA  NA
##  5: 4847       1     1      1   NA      NA  NA  NA   NA   NA  NA  NA
##  6: 4848       1     1      1    1      NA  NA  NA   NA   NA  NA  NA
##  7: 4849       1     1     NA   NA      NA  NA  NA   NA   NA  NA  NA
##  8: 4850       1     1     NA    1       1   1   1   NA   NA  NA  NA
##  9: 4851       1     1     NA   NA      NA  NA  NA   NA   NA  NA  NA
## 10: 4854       1     1     NA   NA      NA  NA  NA   NA   NA  NA  NA
## 11: 4855       1     1      1    1       1  NA  NA   NA   NA  NA  NA
## 12: 4857       1     1      1    1       1   1   1    1    1  NA  NA
## 13: 4858       1     1      1    1       1   1   1    1    1   1   1
## 14: 4859       1     1      1    1       1  NA  NA   NA   NA  NA  NA
## 15: 4861       1     1      1    1       1   1   1    1    1   1   1
## 16: 4862       1     1      1    1       1   1   1    1    1  NA  NA
## 17: 4863       1     1     NA   NA      NA  NA  NA   NA   NA  NA  NA
## 18: 4864       1     1     NA   NA      NA  NA  NA   NA   NA  NA  NA
## 19: 4865       1     1      1   NA      NA  NA  NA   NA   NA  NA  NA


Fill in missing values

Let’s get rid of those NA’s.

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)
dcast(DT, fish ~ station, value.var = "seen", fill = 0)

Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. The same applies to melt, as you’ll have seen above.

DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]

Generate column names from multiple variables

Using the us_rent_income data

show us_rent_income data
## # A tibble: 104 × 5
##    GEOID NAME       variable estimate   moe
##    <chr> <chr>      <chr>       <dbl> <dbl>
##  1 01    Alabama    income      24476   136
##  2 01    Alabama    rent          747     3
##  3 02    Alaska     income      32940   508
##  4 02    Alaska     rent         1200    13
##  5 04    Arizona    income      27517   148
##  6 04    Arizona    rent          972     4
##  7 05    Arkansas   income      23789   165
##  8 05    Arkansas   rent          709     5
##  9 06    California income      29454   109
## 10 06    California rent         1358     3
## # … with 94 more rows


Comparison:

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
  )
dcast(DT, GEOID + NAME ~ variable, 
          value.var = c("estimate","moe"))

Alternatively we can pass ... to indicate all other unspecified columns:

dcast(DT, ... ~ variable, 
      value.var = c("estimate","moe"))

Specify a different names separator

us_rent_income |>
  pivot_wider(
    names_from = variable,
    names_sep = ".",
    values_from = c(estimate, moe)
  )
dcast(DT, GEOID + NAME ~ variable,
      value.var = c("estimate","moe"), 
      sep = ".")
## alternatively
DT[, dcast(.SD, GEOID + NAME ~ variable,
     value.var = c("estimate","moe"), 
          sep = ".")]

Names vary

From the tidyr help: “You can control whether names_from values vary fastest or slowest relative to the values_from column names using names_vary”.

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "slowest"
  ) |> names()
## [1] "GEOID"           "NAME"            "estimate_income" "moe_income"     
## [5] "estimate_rent"   "moe_rent"

To mimic this in data.table, we specify the final column order

DT[, dcast(.SD, GEOID + NAME ~ variable,
           value.var = c("estimate","moe"))
   ][,c(1:3,5,4,6)] |> names()
## [1] "GEOID"           "NAME"            "estimate_income" "moe_income"     
## [5] "estimate_rent"   "moe_rent"

The fastest option is exactly the same as the original

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "fastest"
  ) |> names()
## [1] "GEOID"           "NAME"            "estimate_income" "estimate_rent"  
## [5] "moe_income"      "moe_rent"
DT[, dcast(.SD, GEOID + NAME ~ variable,
           value.var = c("estimate","moe"))] |> names()
## [1] "GEOID"           "NAME"            "estimate_income" "estimate_rent"  
## [5] "moe_income"      "moe_rent"

Performing aggregation with values_fn

Using the warpbreaks dataset:

show warpbreaks
##    breaks wool tension
## 1      26    A       L
## 2      30    A       L
## 3      54    A       L
## 4      25    A       L
## 5      70    A       L
## 6      52    A       L
## 7      51    A       L
## 8      26    A       L
## 9      67    A       L
## 10     18    A       M
## 11     21    A       M
## 12     29    A       M
## 13     17    A       M
## 14     12    A       M
## 15     18    A       M
## 16     35    A       M
## 17     30    A       M
## 18     36    A       M
## 19     36    A       H
## 20     21    A       H
## 21     24    A       H
## 22     18    A       H
## 23     10    A       H
## 24     43    A       H
## 25     28    A       H
## 26     15    A       H
## 27     26    A       H
## 28     27    B       L
## 29     14    B       L
## 30     29    B       L
## 31     19    B       L
## 32     29    B       L
## 33     31    B       L
## 34     41    B       L
## 35     20    B       L
## 36     44    B       L
## 37     42    B       M
## 38     26    B       M
## 39     19    B       M
## 40     16    B       M
## 41     39    B       M
## 42     28    B       M
## 43     21    B       M
## 44     39    B       M
## 45     29    B       M
## 46     20    B       H
## 47     21    B       H
## 48     24    B       H
## 49     17    B       H
## 50     13    B       H
## 51     15    B       H
## 52     15    B       H
## 53     16    B       H
## 54     28    B       H


warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )
dcast(DT, tension ~ wool, 
          value.var = "breaks", fun = mean)
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )
## # A tibble: 3 × 3
##   tension     A     B
##   <fct>   <dbl> <dbl>
## 1 L        44.6  28.2
## 2 M        24    28.8
## 3 H        24.6  18.8
dcast(DT, tension ~ wool, 
           value.var = "breaks", fun = mean)
##    tension        A        B
## 1:       L 44.55556 28.22222
## 2:       M 24.00000 28.77778
## 3:       H 24.55556 18.77778

Or, do it inside DT:

DT[, dcast(.SD, 
tension ~ wool, 
value.var = "breaks",
fun = mean)]

I hope this is useful to others.
If nothing else, please at least give me credit for not using the Ross Geller GIF.

0

© 2016 - 2023. All rights reserved.