Publications by David Robinson
Understanding empirical Bayes estimation (using baseball statistics)
Which of these two proportions is higher: 4 out of 10, or 300 out of 1000? This sounds like a silly question. Obviously , which is greater than . But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits i...
11276 sym R (1906 sym/6 pcs) 8 img 5 tbl
Understanding credible intervals (using baseball statistics)
Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) In my last post, I explained the method of empirical Bayes estimation, a way to calculate useful proportions out of many pairs of success/total counts (e.g. 0/1, 3/10, 235/1000). I used t...
9387 sym R (1545 sym/5 pcs) 10 img 2 tbl
Understanding the Bayesian approach to false discovery rates (using baseball statistics)
Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) Understanding credible intervals (using baseball statistics) In my last few posts, I’ve been exploring how to perform estimation of batting averages, as a way to demonstrate empirical B...
11799 sym R (1618 sym/10 pcs) 8 img 2 tbl
What are the most polarizing programming languages?
Users on Stack Overflow Careers, our site for matching developers with jobs, can create customized profiles (“CVs”) to show to prospective employers. As part of these profiles, they have the option of specifying specific technologies they like or dislike. This produces an interesting and unusual opportunity for our data team to analyze the o...
5302 sym 12 img
Cleaning and visualizing genomic data: a case study in tidy analysis
I recently ran into a question looking for a case study in genomics, particularly for teaching ggplot2, dplyr, and the tidy data framework developed by Hadley Wickham. There exist many great resources for learning how to analyze genomic data using Bioconductor tools, including these workflows and package vignettes. But case studies for teaching t...
16586 sym R (3463 sym/18 pcs) 16 img
Modeling gene expression with broom: a case study in tidy analysis
Previously in this series Cleaning and visualizing genomic data: a case study in tidy analysis In the last post, we examined an available genomic dataset from Brauer et al 2008 about yeast gene expression under nutrient starvation. We learned to tidy it with the dplyr and tidyr packages, and saw how useful this tidied form is for visualizing an...
11718 sym R (6100 sym/20 pcs) 20 img
The ‘lost boarding pass’ puzzle: efficient simulation in R
A family member recently sent me a puzzle: One hundred people are lined up with their boarding passes showing their seats on the 100-seat Plane. The first guy in line drops his pass as he enters the plane, and unable to pick it up with others behind him sits in a random seat. The people behind him, who have their passes, sit in their seats unti...
7929 sym R (4435 sym/13 pcs) 4 img
Analyzing networks of characters in ‘Love Actually’
Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here). Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This got me wondering how we could visualize the connecti...
5560 sym R (2985 sym/8 pcs) 12 img
Why I use ggplot2
If you’ve read my blog, taken one of my classes, or sat next to me on an airplane, you probably know I’m a big fan of Hadley Wickham’s ggplot2 package, especially compared to base R plotting. Not everyone agrees. Among the anti-ggplot2 crowd is JHU Professor Jeff Leek, who yesterday wrote up his thoughts on the Simply Statistics blog: …o...
20621 sym R (1274 sym/3 pcs) 10 img
How to replace a pie chart
Yesterday a family member forwarded me a Wall Street Journal interview titled What Data Scientists Do All Day At Work. The title intrigued me immediately, partly because I find myself explaining that same topic somewhat regularly. I wasn’t disappointed in the interview: General Electric’s Dr. Narasimhan gave insightful and well-communicated a...
6191 sym R (2297 sym/9 pcs) 20 img 1 tbl