Publications by David Robinson

Are high-reputation users quitting Stack Overflow?

14.12.2014

I spend a good amount of time on the programming Q+A site StackOverflow (and a smaller amount of time on its statistics sister site, Cross Validated). Recently this question on Meta Stack Overflow (the website’s discussion forum) caught my attention, raising the question of whether Stack Overflow had become “more negative” recently. It wasn...

7039 sym R (2770 sym/9 pcs) 14 img

Don’t teach built-in plotting to beginners (teach ggplot2)

15.12.2014

I have some experience teaching R programming (see, for instance, my online course). One of the atypical choices I make is to start by teaching Hadley Wickham’s ggplot2 package, rather than the built-in R plotting (see these videos). Many times that I mention this choice to others involved in statistics education, they treat it like a mistake t...

3902 sym R (249 sym/2 pcs) 4 img 1 tbl

Can R and ggvis help solve Serial’s murder?

22.12.2014

Like much of America, I followed season one of Sarah Koenig’s true-crime podcast Serial with an interest that bordered on obsession. Serial tells the story of the Baltimore 1999 murder of high-schooler Hae Min Lee, and of state prisoner Adnan Syed, who was convicted of the crime but to this day maintains his innocence. One especially gripping e...

3661 sym 2 img

K-means clustering is not a free lunch

15.01.2015

I recently came across this question on Cross Validated, and I thought it offered a great opportunity to use R and ggplot2 to explore, in depth, the assumptions underlying the k-means algorithm. The question, and my response, follow. K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assump...

9072 sym 14 img

What kind of programmer are you? Stack Exchange can predict it, Shiny can graph it

01.02.2015

I was impressed by Stack Exchange’s recent announcement of their machine learning system, Providence, that guesses what kind of programmer you are based on your Stack Overflow traffic. Stack Exchange uses this to choose what questions to show you on their homepage and to recommend jobs to you in their Careers ads. One admirable feature is that ...

1950 sym 2 img

Introducing stackr: An R package for querying the Stack Exchange API

03.02.2015

There’s no end of interesting data analyses that can be performed with Stack Overflow and the Stack Exchange network of Q&A sites. Earlier this week I posted a Shiny app that visualizes the personalized prediction data from their machine learning system, Providence. I’ve also looked at whether high-reputation users were decreasing their answe...

4082 sym R (2625 sym/7 pcs) 10 img

View package downloads over time with Shiny

05.03.2015

Almost everyone with an R package in CRAN wonders how often it’s installed and used. Two years ago RStudio kindly started offering anonymized logs of their downloads from their CRAN mirror, which allows one to graph the number of downloads over time. Much easier than downloading and processing all of the log files, however, is working with RStu...

1104 sym 2 img

broom: a package for tidying statistical models into data frames

19.03.2015

The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation, analysis, and visualization. Popular packages like dplyr, tidyr and ggplot2 take great advantage of this framework, as explored in several recent posts by others. But there’s an important step in a tidy data workflow that so far h...

5548 sym R (3592 sym/9 pcs) 6 img

Slides from my talk on the broom package

13.04.2015

This weekend I gave a presentation on my broom package for tidying model objects (see my introduction here) at the UP-STAT 2015 conference at SUNY Geneseo. I’m sharing the slides here, along with some highlights below. I first explained how broom fits with other tidy tools such as dplyr, tidyr and ggplot2 as part of an exploratory data analysis...

1182 sym 10 img

Is Bayesian A/B Testing Immune to Peeking? Not Exactly

20.08.2015

Since I joined Stack Exchange as a Data Scientist in June, one of my first projects has been reconsidering the A/B testing system used to evaluate new features and changes to the site. Our current approach relies on computing a p-value to measure our confidence in a new feature. Unfortunately, this leads to a common pitfall in performing A/B test...

21107 sym 18 img 1 tbl