Publications by Zach Mayer

Intro

22.04.2011

This blog will show you how to build tools to survive in the modern world. I will focus on statistics and machine learning, because that’s where my strengths lie, but sometime we may find ourselves veering far off course.My primary interest lies in using computers to solve problems, and I will spend the majority of my time discussi...

875 sym 2 img

Parallelizing and cross-validating feature selection in R

29.04.2011

This is an example piece of code for the Overfitting competition at kaggle.com. This method has an AUC score of ~.91, which is currently good enough for about 38th place on the leaderboard. If you read the completion forums closely, you will find code that is good enough to tie for 25th place, as well as hints as to how to break into...

1101 sym 2 img

Kaggle Competition Walkthrough: Introduction

03.05.2011

Kaggle is a site for participating in predictive analytics competitions. It is also a great resource for learning how to build powerful predictive models, and the Overfitting competition provides a good introduction to the common tools used by a predictive analyst.To start, you will need to download R for your platform. If you don’t live near P...

4108 sym 2 img

Kaggle Competition Walkthrough: Fitting a model

12.05.2011

Now that we’ve got the data we need into R, it is very easy to fit a model using the caret package. Caret’s workhorse function is called ‘train,’ and it allows you to fit a wide variety of models using the same syntax. Furthermore, many models have ‘hyperparameters’ that require tuning, such as the number of neighbors fo...

4039 sym 4 img

Kaggle Competition Walkthrough: Wrapup

01.06.2011

The Kaggle Don’t Overfit competition is over, and I took 11th place! Additionally, I tied with tks for contributing the most to the forum, so thanks to everyone who voted for me! I voted for tks, and I’m very happy to share the prize with him, as most of my code is based off of his work.The top finishers in this competition did a...

4892 sym 4 img

Importing google news data to R

06.07.2011

I’ve been playing around lately with the stock market data available from google finance, through quantmod in R. Here’s a function I’ve written (which depends on the R Data Science Toolkit), to pull news stories related to a stock from google, parse them, and save them as a data frame. Let me know what you think! Related To ...

729 sym 2 img

Parallel random forests using foreach

22.07.2011

There’s been some discussion on the kaggle forums and on a few blogs about various ways to parallelize random forests, so I thought I’d add my thoughts on the issue.Here’s my version of the ‘parRF’ function, which is based on the elegant version in the foreach vignette:This function works very simply: you pass it a vector of...

2085 sym 2 img

Forecasting recessions

09.08.2011

John Hussman has a Recession Warning Composite that I am attempting to replicate/improve. The underlying data seems to be easy enough to get from FRED using the quantmod package in R. I don’t quite understand the index Hussman is using for commercial paper, so I used the ‘3-month AA financial commercial paper index’ from FRED....

934 sym 2 img

Scraping web data in R

10.08.2011

In my last post, I went through a lot of effort to scrape the PMI index off the ISM website.  It turns out that was unnecessary effort, as commentator “senne” pointed out that this index is available from FRED, with the symbol NAPM.  I’ve updated my code, which now pulls all the data straight from FRED. However, it was sur...

953 sym 8 img

Using the google prediction API from R

10.08.2011

Google has a “black box” prediction API that they provide for use with creating recommender systems or filtering spam. Furthermore, they provide an R package for interfacing that API, but try as I might I cannot get it to work under windows. Here are the instructions for setting up the API to run in R under linux. I haven’t tr...

3623 sym 8 img