Publications by David Robinson

The monetizr package: make money on your open source R packages


I’ve had the great privilege to be a small part of the R open source community, contributing packages like broom, gganimate, fuzzyjoin, and ggfreehand. In the process I’ve become friends and colleagues with brilliant statisticians and data scientists and learned to engage with data in powerful ways. But there’s one thing that my colleagues ...

3477 sym R (1084 sym/12 pcs)

The adblockr package: block ads from the monetizr package


I was horrified to learn of the existence of the monetizr package, which adds advertisements to R functions. The package goes against the entire philosophy of open source and the spirit of the R community. Luckily, I was able to construct a fix- the adblockr package. Use freely to get an ad-free experience For example, imagine someone locked the ...

1249 sym R (363 sym/6 pcs)

Understanding Bayesian A/B testing (using baseball statistics)


Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) Understanding credible intervals (using baseball statistics) Understanding the Bayesian approach to false discovery rates (using baseball statistics) Who is a better batter: Mike Piazza o...

14104 sym R (6284 sym/21 pcs) 10 img 1 tbl

Understanding beta binomial regression (using baseball statistics)


Previously in this series: Understanding the beta distribution Understanding empirical Bayes estimation Understanding credible intervals Understanding the Bayesian approach to false discovery rates Understanding Bayesian A/B testing In this series we’ve been using the empirical Bayes method to estimate batting averages of baseball players. Em...

10299 sym R (2417 sym/10 pcs) 10 img

One year as a Data Scientist at Stack Overflow


One day in January 2013 I found myself wasting time on the internet. This wasn’t a good idea: I was as busy as anyone 2.5 years into their PhD. I had to finish a presentation on some yeast genetics research, I was months behind on a paper with an NYU collaborator and even farther behind on some leftover undergraduate research. I was also busy i...

19775 sym 10 img

Releasing the StackLite dataset of Stack Overflow questions and tags


At Stack Overflow we’ve always been committed to sharing data: all content contributed to the site is CC-BY-SA licensed, and we release regular “data dumps” of our entire history of questions and answers. I’m excited to announce a new resource specially aimed at data scientists, analysts and other researchers, which we’re calling the St...

3435 sym R (2814 sym/9 pcs) 4 img

stacksurveyr: An R package with the 2016 Developer Survey Results


This year, more than fifty thousand programmers answered the Stack Overflow 2016 Developer Survey, in the largest survey of professional developers in history. Last week Stack Overflow released the full (anonymized) results of the survey at To make analysis in R even easier, today I’m also releasing the stacksurveyr ...

2551 sym R (5236 sym/15 pcs) 6 img

Does sentiment analysis work? A tidy analysis of Yelp reviews


This year Julia Silge and I released the tidytext package for text mining using tidy tools such as dplyr, tidyr, ggplot2 and broom. One of the canonical examples of tidy text mining this package makes possible is sentiment analysis. Sentiment analysis is often used by companies to quantify general social media opinion (for example, using tweets a...

7022 sym R (9906 sym/25 pcs) 10 img

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half


I don’t normally post about politics (I’m not particularly savvy about polling, which is where data science has had the largest impact on politics). But this weekend I saw a hypothesis about Donald Trump’s twitter account that simply begged to be investigated with data: Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic...

9691 sym R (5634 sym/15 pcs) 14 img

useR and JSM 2016 conferences: a story in tweets


I was amused by a Guardian article last month that declared “I’m a serious academic, not a professional Instagrammer,” arguing that social media is a distraction for scientific research. This attitude was, to say the least, not popular on academic Twitter, which responded with the #seriousacademic hashtag. When someone tries to claim that ...

16247 sym R (800 sym/2 pcs) 4 img