Publications by mjbommar
Pre-processing text: R/tm vs. python/NLTK
Let’s say that you want to take a set of documents and apply a computational linguistic technique. If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s). In the past, I’ve...
2341 sym 10 img
R Bloggers: The Site I Wish Existed in 2007
My first experience with R was in 2007 as a sophomore in undergrad. As part of a larger project on pricing day-ahead electricity futures, I wanted to cluster locational marginal price (LMP) data from the ISO-NE. Something like k-means is easy to plot and visualize in low-dimensions, but this data was better approached by hierarchical met...
2084 sym 10 img
Dataset: Tweets from the Chinese Protests #cn220
Earlier this week, I posted a ~100k tweet dataset on the #25bahman protests in Iran. The corresponding figure of frequencies showed a strong presence on Twitter, with over 500 tweets per 5 minute period at peak. You can download the dataset or check out the figure in that post. I decided to take a quick snapshot of the corresponding #...
1356 sym 12 img
Tracking the Frequency of Twitter Hashtags with R
I’ve posted three examples of Twitter hashtags datasets in the last week: one on China, one on Iran, and one on Algeria. In order to build these datasets, I needed to obtain older tweets; this is slightly more difficult than simply filtering the streaming feed for your hashtag of choice. The original code I wrote for this task is in Pytho...
2092 sym 12 img
Dataset: Wisconsin Union Protester Tweets #wiunion
I’ve been playing with Twitter data over the last week, archiving Algerian, Egyptian, Iranian, and Chinese tweets. I thought I’d bring the story a little closer to home this time by archiving tweets from Wisconsin Union protesters on the #wiunion tag. Grab the dataset of 165,593 tweets here, and check out the two figure of 5-minute t...
869 sym 14 img
A quick look at #march11 / #saudi tweets
Well, so much for that #march11 #Saudi day of rage. Whether it was really the “tempest in a teacup” that Prince Al-Waleed suggested on CNBC (video below, transcript here) or not, the oil complex and Saudi markets seem to have shrugged off much of the risk that was priced in after Thursday’s rumors of shots. As I’ve done before, I ...
1180 sym 12 img