Publications by Edwin Chen
Piiikaaachuuuuuu vs. KHAAAAAN!
This is a fun image I found on Neil Kodner’s blog: But I’ve never actually watched any of the Star Trek movies, so I decided to recreate the graph with Pikachu instead: Here’s a smoothed version to better compare the counts between different letters: Unsurprisingly, people like to elongate the “u” in “pikachu” a lot better than ...
891 sym 22 img
Hacker News Analysis
I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of my findings. Activity on the Site My first question was: how has activity on the site increased over time? I looked at number of posts, points on posts, and comments on posts. Posts This looks like a strong linear fit, with an increas...
2850 sym 48 img
Introduction to Cointegration and Pairs Trading
Introduction Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths. But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that ...
5498 sym 42 img
A Mathematical Introduction to Least Angle Regression
(For a layman’s introduction, see here.) Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods: Forward selection starts with no variables in the model, and a...
4905 sym 2 img
Kickstarter Data Analysis: Success and Pricing
Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands. What’s particularly fun a...
5436 sym 30 img
Choosing a Machine Learning Classifier
How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “...
4785 sym
Filtering for English Tweets: Unsupervised Language Detection on Twitter
(See a demo here.) While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.) Since I didn’t have any ...
5939 sym 2 img
Bayesian Confidence Intervals: Obama’s ‘That’-Addition and Informality
No “That” Left Behind? I came across a post on Language Log last week giving some evidence that Obama tends to add that to the prepared version of his speeches. For example, in a recent speech at George Washington University, the prepared speech was written as It’s about the kind of future we want. It’s about the kind of country we believ...
5372 sym R (1347 sym/4 pcs) 70 img
Topic Modeling the Sarah Palin Emails
tl;dr Browse through Sarah Palin’s emails, automagically organized by topic, here. LDA-based Email Browser Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, u...
2746 sym 20 img
Introduction to Restricted Boltzmann Machines
Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like...
11819 sym R (494 sym/1 pcs) 2 img