Publications by Andrew Landgraf

Empirical Bayes Estimation of On Base Percentage

30.12.2010

I guess you could call this On Bayes Percentage. *cough*Fresh off learning Bayesian techniques in one of my classes last quarter, I thought it would be fun to try to apply the method. I was able to find some examples of Hierarchical Bayes being used to analyze baseball data at Wharton. Setting up the problemOn base percentage (OBP) is...

4765 sym 6 img 2 tbl

Unsupervised Image Segmentation with Spectral Clustering with R

12.02.2012

That title is quite a mouthful. This quarter, I have been reading papers on Spectral Clustering for a reading group. The basic goal of clustering is to find groups of data points that are similar to each other. Also, data points in one group should be dissimilar to data in other clusters. This way you can summarize your data by saying...

8213 sym 14 img

Visualizing the Correlations of a Matrix

17.02.2012

Correlation matrices are a common way to look at the dependence of a set of variables. When the variables have spatial relationships, the correlation matrix loses some information.Lets say you have repeated observations, each one being a matrix. For example, you could have yearly observations of health statistics for a spatial grid. L...

3282 sym 4 img

What’s Up with Albert Pujols?

05.05.2012

After signing a huge deal with the Angels, Pujols has been having a really bad year. He hasn’t hit a home run this year, breaking a career long streak. So I thought it would be a good idea to use some statistics to tell how good or bad we think Pujols will actually be this year.Coming into the year, he had a career .328/.420/.617 ca...

2846 sym 4 img 1 tbl

Cleveland Indians’ Attendance

20.05.2012

Recently, Chris Perez, the closer for the Indians, displayed some frustration with the fans for not supporting the team. Currently, they have the lowest attendance in the majors — by a decent margin. The Indians are averaging about 15,000 fans per home game, while the next closest team, the Oakland A’s, is averaging 19,000. It see...

2972 sym R (153 sym/1 pcs) 2 img

Sending a Text in R

25.05.2012

Don’t you hate it when you are running a long piece of code and you keep checking the results every 15 minutes, hoping it will finish? There is a better way.I got the idea from here. He uses a Python script and the text interface is not free. I thought someone must have already thought of this for R. There is an easy solution. You c...

1158 sym R (182 sym/1 pcs) 2 img

Space Time Swing Probability Plot for Ichiro

30.05.2012

I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before. Generalized additive models (GAMs) are statistical models that put minimal assumptions on the type of ...

2088 sym

Rounding in R

15.06.2012

Forgive me if you are already aware of this, but I found it quite alarming. I know that most code is interpreted by the computer in binary and we input in decimal, so problems can arise in conversion and with floating point. But the example I have below is so simple that it really surprised me.I was converting a function from R into M...

2031 sym R (147 sym/1 pcs)

Random Forest Variable Importance

19.07.2012

Random forests ™ are great. They are one of the best “black-box” supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests. They can deal with messy, real data. If there are lots of extraneous predictors, it has no problem. It automatically does a good job of finding interact...

5610 sym R (458 sym/1 pcs)

Finding the Best Subset of a GAM using Tabu Search and Visualizing It in R

24.08.2012

Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a ...

4458 sym R (2633 sym/4 pcs) 4 img