Publications by inkhorn82
Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex
After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working. Let’s have a look at my revised python code for processing the corpus: As I did not change the R code since the last post, let’s have a...
3584 sym Python (7562 sym/11 pcs) 8 img
Nuclear vs Green Energy: Share the Wealth or Get Your Own?
Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP). The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!). As you’ll see in later images in ...
6309 sym 20 img
Teaching a Class of Undergrads, RStudio Server, and My Ubuntu Machine
I was chatting about public speaking with my brother, who is a Lecturer in the Faculty of Pharmacy at UofT, when he offered me the opportunity to come to his class and teach about R. Always eager to spread the analytical goodness, I said yes! The class is this Friday, and I am excited. For this class I’ll be making use of RStudio Server, ra...
2142 sym 6 img
UofT R session went well. Thanks RStudio Server!
Apart from going longer than I had anticipated, very little of any significance went wrong during my R session at UofT on friday! It took a while at the beginning for everyone to get set up. Everyone was connecting to my home RStudio server via UofT’s wireless network. This meant that if any students weren’t set up to use wireless in th...
2307 sym 6 img
A Delicious Analysis! (aka topic modelling using recipes)
A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing). I thought this was really neat and became interested in potentially using the data ...
5633 sym R (445 sym/7 pcs) 10 img
Ontario First Nations Libraries Compared Using Ontario Open Data
I recently downloaded a very cool dataset on Ontario libraries from the Ontario Open Data Catalogue. The dataset contains 142 columns of information describing 386 libraries in Ontario, representing a fantastically massive data collection effort for such important cultural institutions (although the most recent information available is as of 2...
8045 sym R (853 sym/5 pcs) 24 img
Data Until I Die: My blog title and statement of values
When I started keeping this Blog, my intent was to write about and keep helpful snippets of R code that I used in the line of work. It was the start of my second job after grad school and I was really excited about getting to use R on a regular basis outside of academia! Well, time went on and so did the number of posts I put on here. After...
1844 sym 4 img
Predictive modelling fun with the caret package
I’m back! 6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about. I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics. With tha...
4419 sym R (4408 sym/7 pcs) 4 img
Contraceptive Choice in Indonesia
I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try. To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia. This was an interesting opportunity for me to learn about a fa...
7708 sym R (5949 sym/4 pcs) 22 img
Predicting Mobile Phone Prices
Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions. Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing. It was easy enough to get the data...
7350 sym R (1499 sym/2 pcs) 22 img