Publications by David Robinson
Are high-reputation users quitting Stack Overflow?
I spend a good amount of time on the programming Q+A site StackOverflow (and a smaller amount of time on its statistics sister site, Cross Validated). Recently this question on Meta Stack Overflow (the website’s discussion forum) caught my attention, raising the question of whether Stack Overflow had become “more negative” recently. It wasn...
7039 sym R (2770 sym/9 pcs) 14 img
Don’t teach built-in plotting to beginners (teach ggplot2)
I have some experience teaching R programming (see, for instance, my online course). One of the atypical choices I make is to start by teaching Hadley Wickham’s ggplot2 package, rather than the built-in R plotting (see these videos). Many times that I mention this choice to others involved in statistics education, they treat it like a mistake t...
3902 sym R (249 sym/2 pcs) 4 img 1 tbl
Can R and ggvis help solve Serial’s murder?
Like much of America, I followed season one of Sarah Koenig’s true-crime podcast Serial with an interest that bordered on obsession. Serial tells the story of the Baltimore 1999 murder of high-schooler Hae Min Lee, and of state prisoner Adnan Syed, who was convicted of the crime but to this day maintains his innocence. One especially gripping e...
3661 sym 2 img
K-means clustering is not a free lunch
I recently came across this question on Cross Validated, and I thought it offered a great opportunity to use R and ggplot2 to explore, in depth, the assumptions underlying the k-means algorithm. The question, and my response, follow. K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assump...
9072 sym 14 img
What kind of programmer are you? Stack Exchange can predict it, Shiny can graph it
I was impressed by Stack Exchange’s recent announcement of their machine learning system, Providence, that guesses what kind of programmer you are based on your Stack Overflow traffic. Stack Exchange uses this to choose what questions to show you on their homepage and to recommend jobs to you in their Careers ads. One admirable feature is that ...
1950 sym 2 img
Introducing stackr: An R package for querying the Stack Exchange API
There’s no end of interesting data analyses that can be performed with Stack Overflow and the Stack Exchange network of Q&A sites. Earlier this week I posted a Shiny app that visualizes the personalized prediction data from their machine learning system, Providence. I’ve also looked at whether high-reputation users were decreasing their answe...
4082 sym R (2625 sym/7 pcs) 10 img
View package downloads over time with Shiny
Almost everyone with an R package in CRAN wonders how often it’s installed and used. Two years ago RStudio kindly started offering anonymized logs of their downloads from their CRAN mirror, which allows one to graph the number of downloads over time. Much easier than downloading and processing all of the log files, however, is working with RStu...
1104 sym 2 img
broom: a package for tidying statistical models into data frames
The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation, analysis, and visualization. Popular packages like dplyr, tidyr and ggplot2 take great advantage of this framework, as explored in several recent posts by others. But there’s an important step in a tidy data workflow that so far h...
5548 sym R (3592 sym/9 pcs) 6 img
Slides from my talk on the broom package
This weekend I gave a presentation on my broom package for tidying model objects (see my introduction here) at the UP-STAT 2015 conference at SUNY Geneseo. I’m sharing the slides here, along with some highlights below. I first explained how broom fits with other tidy tools such as dplyr, tidyr and ggplot2 as part of an exploratory data analysis...
1182 sym 10 img
Is Bayesian A/B Testing Immune to Peeking? Not Exactly
Since I joined Stack Exchange as a Data Scientist in June, one of my first projects has been reconsidering the A/B testing system used to evaluate new features and changes to the site. Our current approach relies on computing a p-value to measure our confidence in a new feature. Unfortunately, this leads to a common pitfall in performing A/B test...
21107 sym 18 img 1 tbl