Publications by inkhorn82
Access individual elements of a row while using the apply function on your dataframe (or “applying down while thinking across”)
The apply function in R is a huge work-horse for me across many projects. My usage of it is pretty stereotypical. Usually, I use it to make aggregations of a targeted group of columns for every row in a dataframe. Those aggregations could be counts, sum totals, or possibly a binary column that flags some condition based on a count or sum to...
4696 sym R (749 sym/7 pcs) 8 img
save.ffdf and load.ffdf: Save and load your big data – quickly and neatly!
I’m very indebted to the ff and ffbase packages in R. Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work. Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload. The syntax isn’t so bad, but the resul...
1814 sym 6 img
Estimate Age from First Name
Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880 (see here). I thought that it might be good to compile and use these lists at wor...
2639 sym 6 img
Estimating Ages from First Names Part 2 – Using Some Morbid Test Data
In my last post, I wrote about how I compiled a US Social Security Agency data set into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets. I also mentioned the need for data to test out the accuracy of my estimates. First, I’ll show you how I prepped the dataset that it became more scalable (for...
3157 sym R (424 sym/2 pcs) 6 img
Package sqldf eases the multivariable sorting pain
This will be a quick one. I was trying to sort my dataframe so that it went in ascending order on one variable and descending order on another variable. This was really REALLY bothersome to try to figure out with base R functions. Then I remembered sqldf! # Assuming dataframe named 'mydf' and 'V1' and 'V2' are your variables you want to sor...
782 sym R (197 sym/1 pcs)
sapply is my new friend!
I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be! There are a couple of cases so far where I’ve found that sapply really comes in handy for me: 1) If I want to quickly see some descriptive stats for multiple columns in my datafr...
1402 sym 4 img
Who uses E-Bikes in Toronto? Fun with Recursive Partitioning Trees and Toronto Open Data
I found a fun survey released to the Toronto Open Data website that investigates the travel/commuting behaviour of Torontonians, but with a special focus on E-bikes. When I opened up the file, I found various demographic information, in addition to a question asking people their most frequently used mode of transportation. Exactly 2,238 peop...
2888 sym 8 img
Big and small daycares in Toronto by building type, mapped using RGoogleMaps and Toronto Open Data
Before my daughter was born, I thought that my wife and I would have to send her to a licensed child care centre somewhere in Toronto. I had heard over and over how long of a waiting list I should expect the centre to have, and so we’d better get her registered nice and early! Well, it turns out that we found an excellent unlicensed home da...
7787 sym 20 img
When did “How I Met Your Mother” become less legen.. wait for it…
…dary! Or, as you’ll see below, when did it become slightly less legendary? The analysis in this post was inspired by DiffusePrioR’s analysis of when The Simpsons became less Cromulent. When I read his post a while back, I thought it was pretty cool and told myself that I would use his method on another relevant dataset in the future. E...
6639 sym R (1794 sym/4 pcs) 8 img
A Rather Nosy Topic Model Analysis of the Enron Email Corpus
Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R. Whenever I see a really cool analysis done, I get the urge to do it myself. What better corpus to do topic modelling on than the Enron email dataset?!?!? Let me tell you, this thi...
5249 sym Python (9487 sym/10 pcs) 6 img