Publications by Zach Mayer
Intro
This blog will show you how to build tools to survive in the modern world. I will focus on statistics and machine learning, because that’s where my strengths lie, but sometime we may find ourselves veering far off course.My primary interest lies in using computers to solve problems, and I will spend the majority of my time discussi...
875 sym 2 img
Parallelizing and cross-validating feature selection in R
This is an example piece of code for the Overfitting competition at kaggle.com. This method has an AUC score of ~.91, which is currently good enough for about 38th place on the leaderboard. If you read the completion forums closely, you will find code that is good enough to tie for 25th place, as well as hints as to how to break into...
1101 sym 2 img
Kaggle Competition Walkthrough: Introduction
Kaggle is a site for participating in predictive analytics competitions. It is also a great resource for learning how to build powerful predictive models, and the Overfitting competition provides a good introduction to the common tools used by a predictive analyst.To start, you will need to download R for your platform. If you don’t live near P...
4108 sym 2 img
Kaggle Competition Walkthrough: Fitting a model
Now that we’ve got the data we need into R, it is very easy to fit a model using the caret package. Caret’s workhorse function is called ‘train,’ and it allows you to fit a wide variety of models using the same syntax. Furthermore, many models have ‘hyperparameters’ that require tuning, such as the number of neighbors fo...
4039 sym 4 img
Kaggle Competition Walkthrough: Wrapup
The Kaggle Don’t Overfit competition is over, and I took 11th place! Additionally, I tied with tks for contributing the most to the forum, so thanks to everyone who voted for me! I voted for tks, and I’m very happy to share the prize with him, as most of my code is based off of his work.The top finishers in this competition did a...
4892 sym 4 img
Importing google news data to R
I’ve been playing around lately with the stock market data available from google finance, through quantmod in R. Here’s a function I’ve written (which depends on the R Data Science Toolkit), to pull news stories related to a stock from google, parse them, and save them as a data frame. Let me know what you think! Related To ...
729 sym 2 img
Parallel random forests using foreach
There’s been some discussion on the kaggle forums and on a few blogs about various ways to parallelize random forests, so I thought I’d add my thoughts on the issue.Here’s my version of the ‘parRF’ function, which is based on the elegant version in the foreach vignette:This function works very simply: you pass it a vector of...
2085 sym 2 img
Forecasting recessions
John Hussman has a Recession Warning Composite that I am attempting to replicate/improve. The underlying data seems to be easy enough to get from FRED using the quantmod package in R. I don’t quite understand the index Hussman is using for commercial paper, so I used the ‘3-month AA financial commercial paper index’ from FRED....
934 sym 2 img
Scraping web data in R
In my last post, I went through a lot of effort to scrape the PMI index off the ISM website. It turns out that was unnecessary effort, as commentator “senne” pointed out that this index is available from FRED, with the symbol NAPM. I’ve updated my code, which now pulls all the data straight from FRED. However, it was sur...
953 sym 8 img
Using the google prediction API from R
Google has a “black box” prediction API that they provide for use with creating recommender systems or filtering spam. Furthermore, they provide an R package for interfacing that API, but try as I might I cannot get it to work under windows. Here are the instructions for setting up the API to run in R under linux. I haven’t tr...
3623 sym 8 img