Publications by Max Kuhn

Reproducible Research at ENAR

11.03.2013

I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here. I was dinged for not using git for version control (we used dropbox for simplicity) but overall the comments were good. There was a small panel at the end for answering que...

975 sym

Benchmarking Machine Learning Models Using Simulation

13.04.2013

What is the objective of most data analysis? One way I think about it is that we are trying to discover or approximate what is really going on in our data (and in general, nature). However, I occasionally run into people think that if one model fulfills our expectations (e.g. higher number of significant p-values or accuracy) than it must be bett...

5744 sym R (5525 sym/13 pcs) 4 img

Feature Selection Strikes Back (Part 1)

29.04.2013

In the feature selection chapter, we describe several search procedures (“wrappers”) that can be used to optimize the number of predictors. Some techniques were described in more detail than others. Although we do describe genetic algorithms and how they can be used for reducing the dimensions of the data, this is the first of series of blog ...

12005 sym R (4727 sym/7 pcs) 16 img

Feature Selection 2 – Genetic Boogaloo

08.05.2013

Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The data were simulated with 200 non-informative predictors and 12 linear effects and three non-linear effects. Quadratic discriminant analysis (QDA) was used to model the data. Th...

11117 sym R (6197 sym/9 pcs) 18 img

Projection Pursuit Classification Trees

14.05.2013

I’ve been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree in that manner. The subtleties of the model are: The model does not prune but keeps splitting until achieving purity With more than two classes, it treats the data as a two-...

3704 sym

Recent Changes to caret

18.05.2013

Here is a summary of some recent changes to caret. Feature Updates: train was updated to utilize recent changes in the gbm package that allow for boosting with three or more classes (via the multinomial distribution) The Yeo-Johnson power transformation was added. This is very similar to the Box-Cox transformation, but it does not require the d...

1590 sym

Feature Selection 3 – Swarm Mentality

06.06.2013

“Bees don’t swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?” – Hla Stavhana In the last two posts, genetic algorithms were used as feature wrappers to search for more effective subsets of predictors. Here, I will do the same with another type of search algorithm: particle swarm optimization. Like gen...

6185 sym R (3957 sym/5 pcs) 10 img

type = “what?”

13.06.2013

One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However, this does tend to bite us periodically.  When I teach newcomers about R and predictive modeling, I have a slide that illustrates one of the weaknesses of this system: heteroge...

2543 sym 4 img

Measuring Associations

20.06.2013

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al (2011). A summary of the initial reactions to the MIC are Speed and Tibshirani (and others can be found here). My (minor) beef with it is the lack of a probabilistic motivati...

5015 sym R (1746 sym/5 pcs) 4 img

UseR! 2013 Highlights

13.07.2013

The conference was excellent this year. My highlights: Bojan Mihaljevic gave a great presentation on machine learning models built from network models. Their package isn’t on CRAN yet, but I’m really looking forward to it. Jim Harner’s presentation on K-NN models with feature selection was also very interesting, especially the computation...

2082 sym