Publications by inkhorn82
My Goodness. What a Fat Dataset!
Recently at work we got sent a data file containing information on donations to a specific charitable organization, ranging all the way back to the 80′s. Usually, when we receive a dataset with a donation history in it, each row represents a specific gift from a specific person at a specific time. Also, each column represents some kind of i...
1838 sym 4 img
Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees
Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees. Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I�...
4621 sym R (3349 sym/6 pcs) 6 img
My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis
After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough fo...
6239 sym Python (4128 sym/7 pcs) 10 img 1 tbl
Multiple Classification and Authorship of the Hebrew Bible
Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc. I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible). A very cursory reading of the Documentary Hypothesis indica...
8710 sym R (3856 sym/6 pcs) 10 img 4 tbl
Finding Patterns Amongst Binary Variables with the homals Package
It’s survey analysis season for me at work! When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data. In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this ...
2772 sym R (967 sym/3 pcs) 4 img
Split, Apply, and Combine for ffdf
Call me incompetent, but I just can’t get ffdfdply to work with my ffdf dataframes. I’ve tried repeatedly and it just doesn’t seem to work! I’ve seen numerous examples on stackoverflow, but maybe I’m applying them incorrectly. Wanting to do some split-apply-combine on an ffdf, yet again, I finally broke down and made my own funct...
1536 sym R (305 sym/1 pcs) 4 img
Using ddply to select the first record of every group
I had a very long file of monetary transactions (about 207,000 rows) with about two handfuls of columns describing each transaction (including date). The task I needed to perform on this file was to select the value from one of the categorical descriptor columns (called “appeal”) associated with the first transaction found for every ID in t...
1361 sym 6 img
Do Torontonians Want a New Casino? Survey Analysis Part 1
Toronto City Council is in the midst of a very lengthy process of considering whether or not to allow the OLG to build of a new casino in Toronto, and where. The process started in November of 2012, and set out to answer this question through many and varied consultations with the public, and key stakeholders in the city. One of the methods of ...
12422 sym R (541 sym/1 pcs) 22 img
When the “reorder” function just isn’t good enough…
The reorder function, in R 3.0.0, is behaving strangely (or I’m really not understanding something). Take the following simple data frame: df = data.frame(a1 = c(4,1,1,3,2,4,2), a2 = c(“h”,”j”,”j”,”e”,”c”,”h”,”c”)) I expect that if I call the reorder function on the a2 vector, using the a1 vector as the vector to o...
1383 sym R (75 sym/2 pcs) 4 img
Which Torontonians Want a Casino? Survey Analysis Part 2
In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built. So, here goes something: The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the s...
5427 sym R (729 sym/3 pcs) 8 img 1 tbl