Publications by Edwin Chen
Tweets vs. Likes: What gets shared on Twitter vs. Facebook?
It always strikes me as curious that some posts get a lot of love on Twitter, while others get many more shares on Facebook: What accounts for this difference? Some of it is surely site-dependent: maybe one blogger has a Facebook page but not a Twitter account, while another has these roles reversed. But even on sites maintained by a single auth...
11965 sym 44 img
Introduction to Latent Dirichlet Allocation
Introduction Suppose you have the following set of sentences: I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. Chinchillas and kittens are cute. My sister adopted a kitten yesterday. Look at this cute hamster munching on a piece of broccoli. What is latent Dirichlet allocation? It’s a way of automatically...
10674 sym 4 img
Information Transmission in a Social Network: Dissecting the Spread of a Quora Post
tl;dr See this movie visualization for a case study on how a post propagates through Quora. How does information spread through a network? Much of Quora’s appeal, after all, lies in its social graph — and when you’ve got a network of users, all broadcasting their activities to their neighbors, information can cascade in multiple ways. How d...
8095 sym 28 img
Stuff Harvard People Like
What types of students go to which schools? There are, of course, the classic stereotypes: MIT has the hacker engineers. Stanford has the laid-back, social folks. Harvard has the prestigious leaders of the world. Berkeley has the activist hippies. Caltech has the hardcore science nerds. But how well do these perceptions match reality? What are ...
25539 sym
Winning the Netflix Prize: A Summary
How was the Netflix Prize won? I went through a lot of the Netflix Prize papers a couple years ago, so I’ll try to give an overview of the techniques that went into the winning solution here. Normalization of Global Effects Suppose Alice rates Inception 4 stars. We can think of this rating as composed of several parts: A baseline rating (e.g.,...
14776 sym 2 img
Introduction to Conditional Random Fields
Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this? One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snaps...
12233 sym 2 img
Quick Introduction to ggplot2
For a much better looking version of this post (where code is actually readable!), see this Github repository, which also contains some of the example datasets I use and a literate programming version of this tutorial. Introduction This is a bare-bones introduction to ggplot2, a visualization package in R. It assumes no knowledge of R and teaches...
4969 sym R (3418 sym/24 pcs) 38 img
Movie Recommendations and More via MapReduce and Scalding
Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala – which means all the power of Scala and the JVM is already built-in. No more UD...
10773 sym R (10130 sym/30 pcs) 22 img 15 tbl
Instant Interactive Visualization with d3 + ggplot2
It’s often easier to understand a chart than a table. So why is it still so hard to make a simple data graphic, and why am I still bombarded by mind-numbing reams of raw numbers? (Yeah, I love ggplot2 to death. But sometimes I want a little more interaction, and sometimes all I want is to drag-and-drop and be done.) So I’ve been experimenting...
1433 sym 6 img
Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process
Imagine you’re a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day. Your goal: to find natural groups of foodies, so that you can better cater to each cluster’s tastes. For example, your fratboy friends might love wing...
22024 sym Python (4368 sym/12 pcs) 62 img 6 tbl