Publications by Edwin Chen

Piiikaaachuuuuuu vs. KHAAAAAN!

13.03.2011

This is a fun image I found on Neil Kodner’s blog: But I’ve never actually watched any of the Star Trek movies, so I decided to recreate the graph with Pikachu instead: Here’s a smoothed version to better compare the counts between different letters: Unsurprisingly, people like to elongate the “u” in “pikachu” a lot better than ...

891 sym 22 img

Hacker News Analysis

13.03.2011

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of my findings. Activity on the Site My first question was: how has activity on the site increased over time? I looked at number of posts, points on posts, and comments on posts. Posts This looks like a strong linear fit, with an increas...

2850 sym 48 img

Introduction to Cointegration and Pairs Trading

15.04.2011

Introduction Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths. But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that ...

5498 sym 42 img

A Mathematical Introduction to Least Angle Regression

20.04.2011

(For a layman’s introduction, see here.) Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods: Forward selection starts with no variables in the model, and a...

4905 sym 2 img

Kickstarter Data Analysis: Success and Pricing

25.04.2011

Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands. What’s particularly fun a...

5436 sym 30 img

Choosing a Machine Learning Classifier

26.04.2011

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “...

4785 sym

Filtering for English Tweets: Unsupervised Language Detection on Twitter

30.04.2011

(See a demo here.) While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.) Since I didn’t have any ...

5939 sym 2 img

Bayesian Confidence Intervals: Obama’s ‘That’-Addition and Informality

01.05.2011

No “That” Left Behind? I came across a post on Language Log last week giving some evidence that Obama tends to add that to the prepared version of his speeches. For example, in a recent speech at George Washington University, the prepared speech was written as It’s about the kind of future we want. It’s about the kind of country we believ...

5372 sym R (1347 sym/4 pcs) 70 img

Topic Modeling the Sarah Palin Emails

27.06.2011

tl;dr Browse through Sarah Palin’s emails, automagically organized by topic, here. LDA-based Email Browser Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, u...

2746 sym 20 img

Introduction to Restricted Boltzmann Machines

17.07.2011

Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like...

11819 sym R (494 sym/1 pcs) 2 img