Pivoting data in R with tidyr and data.table

How to pivot data in R with data.table and tidyr. Examples of melt, dcast, pivot_longer and pivot_wider
We all need to pivot data at some point, so these are just some notes for my own benefit really, because gather
and spread
are no longer in favour within tidyr
.
NB - this post has been updated with collapsible sections to show/hide the data and outputs.
I tended to only ever need gather
, and nearly always relied on the same key and value names, so it was an easy function for me to use.
pivot_longer
and pivot_wider
are much more flexible, they just take a little bit more thinking about.
For example, my old approach has now changed along these lines
# old way with `gather`
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row, -N) # key = column, value = source, retain row and N
# further transforms
# new way with pivot_longer
df %>%
mutate(row = row_number()) %>%
pivot_longer(!c(row , N),
names_to = 'column',
values_to = 'source')
# further transforms
However, what I really want to do is show how to replicate much of the tidyr
pivot functionality with data.table
.
Once again, this is not intended to be in-depth.
I have simply used the tidyr
help file code, and tried to replicate it with data.table
.
I’d be interested in improvements to my data.table code.
Let’s pivot!
Note - in all examples, I’ll create a copy of the data set as a data.table using setDT(copy(source_data))
I’m using the base pipe for simplicity, so aside from that, all you need is
library(tidyr)
library(data.table)
tidyr::pivot_longer() ~ data.table::melt()
Using the built-in relig_income
dataset:
Show data
## # A tibble: 18 × 11
## religion `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122 109
## 2 Atheist 12 27 37 52 35 70 73 59
## 3 Buddhist 27 21 30 34 33 58 62 39
## 4 Catholic 418 617 732 670 638 1116 949 792
## 5 Don’t know/r… 15 14 15 11 10 35 21 17
## 6 Evangelical … 575 869 1064 982 881 1486 949 723
## 7 Hindu 1 9 7 9 11 34 47 48
## 8 Historically… 228 244 236 238 197 223 131 81
## 9 Jehovah's Wi… 20 27 24 24 21 30 15 11
## 10 Jewish 19 19 25 25 30 95 69 87
## 11 Mainline Prot 289 495 619 655 651 1107 939 753
## 12 Mormon 29 40 48 51 56 112 85 49
## 13 Muslim 6 7 9 10 9 23 16 8
## 14 Orthodox 13 17 23 32 32 47 38 42
## 15 Other Christ… 9 7 11 13 13 14 18 14
## 16 Other Faiths 20 33 40 46 49 63 46 40
## 17 Other World … 5 2 3 4 2 7 3 4
## 18 Unaffiliated 217 299 374 365 341 528 407 321
## # … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
## # abbreviated variable names ¹`$10-20k`, ²`$20-30k`, ³`$30-40k`, ⁴`$40-50k`,
## # ⁵`$50-75k`, ⁶`$75-100k`, ⁷`$100-150k`
Code comparison
relig_income |>
pivot_longer(!religion, # keep religion as a column
names_to = "income", # desired name for new column
values_to = "count") # what data goes into the new column?
melt(DT, id.vars = "religion",
variable.name = "income",
value.name = "count",
variable.factor = FALSE) # added to keep output consistent with tidyr
With data.table, you can often get away with only supplying either measure.vars
or id.vars
, and nothing else, and it does a pretty great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.
You can compare outputs here:
pivot_longer
output
## # A tibble: 180 × 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # … with 170 more rows
melt
output
## religion income count
## 1: Agnostic <$10k 27
## 2: Atheist <$10k 12
## 3: Buddhist <$10k 27
## 4: Catholic <$10k 418
## 5: Don’t know/refused <$10k 15
## ---
## 176: Orthodox Don't know/refused 73
## 177: Other Christian Don't know/refused 18
## 178: Other Faiths Don't know/refused 71
## 179: Other World Religions Don't know/refused 8
## 180: Unaffiliated Don't know/refused 597
Drop missing values
Now, a slightly more complex case where columns have common prefix, and missings are structural so should be dropped, using the billboard
dataset:
show billboard data
## A tibble: 317 × 79
## artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA
## 3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44
## 4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA NA NA NA NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38 38 36 37 37 38 49
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14 12 10 9 8 6 1
## 10 Adams, Yol… Open… 2000-08-26 76 76 74 69 68 67 61 58 57 59 66 68 61 67
## … with 307 more rows, 60 more variables: wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23
## wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33
## wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43
## wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53
## wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63
## wk64 <dbl>, wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>, wk73
## wk74 <lgl>, wk75 <lgl>, wk76 <lgl>, and abbreviated variable name ¹date.entered
## ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Code comparison:
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
melt(DT,
measure.vars = patterns("^wk"),
variable.name = "week",
value.name = "rank",
na.rm = TRUE)
pivot_longer
output
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # … with 5,297 more rows
melt
output
## artist track date.entered week rank
## 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81
## 4: 3 Doors Down Loser 2000-10-21 wk1 76
## 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57
## ---
## 5303: Creed Higher 1999-09-11 wk63 50
## 5304: Lonestar Amazed 1999-06-05 wk63 45
## 5305: Creed Higher 1999-09-11 wk64 50
## 5306: Lonestar Amazed 1999-06-05 wk64 50
## 5307: Creed Higher 1999-09-11 wk65 49
Multiple variables stored in column names
Using the who
dataset
show who
data
## A tibble: 7,240 × 60
## country iso2 iso3 year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AF AFG 1980 NA NA NA NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA NA NA NA
## … with 7,230 more rows, 50 more variables: new_sp_m65 <dbl>,
## new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
## new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## new_sp_f65 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
## new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
## new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>,
## new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, …
## ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
who |> pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)", # Whoa
values_to = "count"
)
# NB - the development version of data.table
# makes this much easier
melt(DT, value.name = "count",
measure.vars = measure(
diagnosis, gender, ages,
pattern = "new_?(.*)_(.)(.*)"))
# OR
DT[,melt(.SD, value.name = "count",
measure.vars = measure(
diagnosis, gender, ages,
pattern = "new_?(.*)_(.)(.*)"))]
This is how I did it in the current version, without blowing up my laptop:
# current version:
DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
][,variable := gsub("new_?","", variable)
][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
][, !c("variable","temp")
][,c(1:4,6:8,5)][]
OK - we’re going to need to break this one down.
Firstly, note that this time round, I am using melt
inside of DT
, rather than melt(DT)
DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
][]
This returns a new variable
column, which needs splitting up into more columns.
First, we need to get rid of that new_
, using gsub
to remove it.
Then, I want the diagnosis part, so I create a diagnosis column, and a “temp” column for the remaining bits, which I’ll split in the next passes
DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
][,variable := gsub("new_?","", variable)
][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
][]
Now to split out the gender
column, from the new temp
column. This is going to be either an “m” or an “f”.
Then retrieve the age
, again from the temp
column. This generates two vectors, I only want the second one, hence the keep = 2
DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
][,variable := gsub("new_?","", variable)
][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
][]
The last two lines discard the variable
and temp
columns, and sort the columns into the same order as that returned by pivot_longer
Final output
## country iso2 iso3 year diagnosis gender age count
## 1: Afghanistan AF AFG 1980 sp m 014 NA
## 2: Afghanistan AF AFG 1981 sp m 014 NA
## 3: Afghanistan AF AFG 1982 sp m 014 NA
## 4: Afghanistan AF AFG 1983 sp m 014 NA
## 5: Afghanistan AF AFG 1984 sp m 014 NA
## ---
## 405436: Zimbabwe ZW ZWE 2009 rel f 65 NA
## 405437: Zimbabwe ZW ZWE 2010 rel f 65 NA
## 405438: Zimbabwe ZW ZWE 2011 rel f 65 NA
## 405439: Zimbabwe ZW ZWE 2012 rel f 65 NA
## 405440: Zimbabwe ZW ZWE 2013 rel f 65 725
Convert matrix to long
Using the anscombe
dataset
show anscombe data
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Here’s what we’re aiming for:
target output
## # A tibble: 44 × 3
## set x y
## <chr> <dbl> <dbl>
## 1 1 10 8.04
## 2 1 8 6.95
## 3 1 13 7.58
## 4 1 9 8.81
## 5 1 11 8.33
## 6 1 14 9.96
## 7 1 6 7.24
## 8 1 4 4.26
## 9 1 12 10.8
## 10 1 7 4.82
## # … with 34 more rows
The two approaches look like this:
anscombe |>
pivot_longer(
everything(),
cols_vary = "slowest",
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
)
DT[,melt(.SD,
variable.name = "set",
value.name = c("x","y"),
variable.factor = FALSE,
measure.vars = patterns("^x","^y"))]
This is the output from melt
- we’ve seen the pivot_longer
outputs above
show output
## set x y
## 1: 1 10 8.04
## 2: 1 8 6.95
## 3: 1 13 7.58
## 4: 1 9 8.81
## 5: 1 11 8.33
## 6: 1 14 9.96
## 7: 1 6 7.24
## 8: 1 4 4.26
## 9: 1 12 10.84
## 10: 1 7 4.82
## 11: 1 5 5.68
## 12: 2 10 9.14
## 13: 2 8 8.14
## 14: 2 13 8.74
## 15: 2 9 8.77
## 16: 2 11 9.26
## 17: 2 14 8.10
## 18: 2 6 6.13
## 19: 2 4 3.10
## 20: 2 12 9.13
## 21: 2 7 7.26
## 22: 2 5 4.74
## 23: 3 10 7.46
## 24: 3 8 6.77
## 25: 3 13 12.74
## 26: 3 9 7.11
## 27: 3 11 7.81
## 28: 3 14 8.84
## 29: 3 6 6.08
## 30: 3 4 5.39
## 31: 3 12 8.15
## 32: 3 7 6.42
## 33: 3 5 5.73
## 34: 4 8 6.58
## 35: 4 8 5.76
## 36: 4 8 7.71
## 37: 4 8 8.84
## 38: 4 8 8.47
## 39: 4 8 7.04
## 40: 4 8 5.25
## 41: 4 19 12.50
## 42: 4 8 5.56
## 43: 4 8 7.91
## 44: 4 8 6.89
## set x y
We’ve done wide to long
, now for the opposite:
tidyr::pivot_wider() ~ data.table::dcast()
I have to be honest, dcast
is not a term that makes me think long to wide
, but you can’t have everything.
We’re using this fish_encounters
dataset for this bit.
It’s about time fish made an appearance on this blog.
Probably.
show fish_encounters
## # A tibble: 114 × 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # … with 104 more rows
Here’s what we want:
show target output
## # A tibble: 19 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6 4848 1 1 1 1 NA NA NA NA NA NA NA
## 7 4849 1 1 NA NA NA NA NA NA NA NA NA
## 8 4850 1 1 NA 1 1 1 1 NA NA NA NA
## 9 4851 1 1 NA NA NA NA NA NA NA NA NA
## 10 4854 1 1 NA NA NA NA NA NA NA NA NA
## 11 4855 1 1 1 1 1 NA NA NA NA NA NA
## 12 4857 1 1 1 1 1 1 1 1 1 NA NA
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 NA NA NA NA NA NA
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 NA NA
## 17 4863 1 1 NA NA NA NA NA NA NA NA NA
## 18 4864 1 1 NA NA NA NA NA NA NA NA NA
## 19 4865 1 1 1 NA NA NA NA NA NA NA NA
Code comparison:
fish_encounters |>
pivot_wider(names_from = station, values_from = seen)
dcast(DT, fish ~ station, value.var = "seen")
Technically, we can avoid stating that value.var = "seen"
, but we do then get a large informational message in the console, which is always disconcerting. (If data.table tells you something, you’re well advised to act on it, because the help is scarily accurate)
Here are the data.table outputs:
show dcast
output
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## 1: 4842 1 1 1 1 1 1 1 1 1 1 1
## 2: 4843 1 1 1 1 1 1 1 1 1 1 1
## 3: 4844 1 1 1 1 1 1 1 1 1 1 1
## 4: 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5: 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6: 4848 1 1 1 1 NA NA NA NA NA NA NA
## 7: 4849 1 1 NA NA NA NA NA NA NA NA NA
## 8: 4850 1 1 NA 1 1 1 1 NA NA NA NA
## 9: 4851 1 1 NA NA NA NA NA NA NA NA NA
## 10: 4854 1 1 NA NA NA NA NA NA NA NA NA
## 11: 4855 1 1 1 1 1 NA NA NA NA NA NA
## 12: 4857 1 1 1 1 1 1 1 1 1 NA NA
## 13: 4858 1 1 1 1 1 1 1 1 1 1 1
## 14: 4859 1 1 1 1 1 NA NA NA NA NA NA
## 15: 4861 1 1 1 1 1 1 1 1 1 1 1
## 16: 4862 1 1 1 1 1 1 1 1 1 NA NA
## 17: 4863 1 1 NA NA NA NA NA NA NA NA NA
## 18: 4864 1 1 NA NA NA NA NA NA NA NA NA
## 19: 4865 1 1 1 NA NA NA NA NA NA NA NA
Fill in missing values
Let’s get rid of those NA’s.
fish_encounters |>
pivot_wider(names_from = station, values_from = seen, values_fill = 0)
dcast(DT, fish ~ station, value.var = "seen", fill = 0)
Rather than have the DT
inside dcast
, we can use .SD
and have dcast
inside DT
, which is helpful for further chaining. The same applies to melt
, as you’ll have seen above.
DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]
Generate column names from multiple variables
Using the us_rent_income
data
show us_rent_income data
## # A tibble: 104 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # … with 94 more rows
Comparison:
us_rent_income |>
pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
)
dcast(DT, GEOID + NAME ~ variable,
value.var = c("estimate","moe"))
Alternatively we can pass ...
to indicate all other unspecified columns:
dcast(DT, ... ~ variable,
value.var = c("estimate","moe"))
Specify a different names separator
us_rent_income |>
pivot_wider(
names_from = variable,
names_sep = ".",
values_from = c(estimate, moe)
)
dcast(DT, GEOID + NAME ~ variable,
value.var = c("estimate","moe"),
sep = ".")
## alternatively
DT[, dcast(.SD, GEOID + NAME ~ variable,
value.var = c("estimate","moe"),
sep = ".")]
Names vary
From the tidyr help: “You can control whether names_from
values vary fastest or slowest relative to the values_from
column names using names_vary
”.
us_rent_income |>
pivot_wider(
names_from = variable,
values_from = c(estimate, moe),
names_vary = "slowest"
) |> names()
## [1] "GEOID" "NAME" "estimate_income" "moe_income"
## [5] "estimate_rent" "moe_rent"
To mimic this in data.table, we specify the final column order
DT[, dcast(.SD, GEOID + NAME ~ variable,
value.var = c("estimate","moe"))
][,c(1:3,5,4,6)] |> names()
## [1] "GEOID" "NAME" "estimate_income" "moe_income"
## [5] "estimate_rent" "moe_rent"
The fastest
option is exactly the same as the original
us_rent_income |>
pivot_wider(
names_from = variable,
values_from = c(estimate, moe),
names_vary = "fastest"
) |> names()
## [1] "GEOID" "NAME" "estimate_income" "estimate_rent"
## [5] "moe_income" "moe_rent"
DT[, dcast(.SD, GEOID + NAME ~ variable,
value.var = c("estimate","moe"))] |> names()
## [1] "GEOID" "NAME" "estimate_income" "estimate_rent"
## [5] "moe_income" "moe_rent"
Performing aggregation with values_fn
Using the warpbreaks
dataset:
show warpbreaks
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
## 7 51 A L
## 8 26 A L
## 9 67 A L
## 10 18 A M
## 11 21 A M
## 12 29 A M
## 13 17 A M
## 14 12 A M
## 15 18 A M
## 16 35 A M
## 17 30 A M
## 18 36 A M
## 19 36 A H
## 20 21 A H
## 21 24 A H
## 22 18 A H
## 23 10 A H
## 24 43 A H
## 25 28 A H
## 26 15 A H
## 27 26 A H
## 28 27 B L
## 29 14 B L
## 30 29 B L
## 31 19 B L
## 32 29 B L
## 33 31 B L
## 34 41 B L
## 35 20 B L
## 36 44 B L
## 37 42 B M
## 38 26 B M
## 39 19 B M
## 40 16 B M
## 41 39 B M
## 42 28 B M
## 43 21 B M
## 44 39 B M
## 45 29 B M
## 46 20 B H
## 47 21 B H
## 48 24 B H
## 49 17 B H
## 50 13 B H
## 51 15 B H
## 52 15 B H
## 53 16 B H
## 54 28 B H
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
dcast(DT, tension ~ wool,
value.var = "breaks", fun = mean)
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
## # A tibble: 3 × 3
## tension A B
## <fct> <dbl> <dbl>
## 1 L 44.6 28.2
## 2 M 24 28.8
## 3 H 24.6 18.8
dcast(DT, tension ~ wool,
value.var = "breaks", fun = mean)
## tension A B
## 1: L 44.55556 28.22222
## 2: M 24.00000 28.77778
## 3: H 24.55556 18.77778
Or, do it inside DT:
DT[, dcast(.SD,
tension ~ wool,
value.var = "breaks",
fun = mean)]
I hope this is useful to others.
If nothing else, please at least give me credit for not using the Ross Geller GIF.