Publications by Max Kuhn
Reproducible Research at ENAR
I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here. I was dinged for not using git for version control (we used dropbox for simplicity) but overall the comments were good. There was a small panel at the end for answering que...
975 sym
Benchmarking Machine Learning Models Using Simulation
What is the objective of most data analysis? One way I think about it is that we are trying to discover or approximate what is really going on in our data (and in general, nature). However, I occasionally run into people think that if one model fulfills our expectations (e.g. higher number of significant p-values or accuracy) than it must be bett...
5744 sym R (5525 sym/13 pcs) 4 img
Feature Selection Strikes Back (Part 1)
In the feature selection chapter, we describe several search procedures (“wrappers”) that can be used to optimize the number of predictors. Some techniques were described in more detail than others. Although we do describe genetic algorithms and how they can be used for reducing the dimensions of the data, this is the first of series of blog ...
12005 sym R (4727 sym/7 pcs) 16 img
Feature Selection 2 – Genetic Boogaloo
Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The data were simulated with 200 non-informative predictors and 12 linear effects and three non-linear effects. Quadratic discriminant analysis (QDA) was used to model the data. Th...
11117 sym R (6197 sym/9 pcs) 18 img
Projection Pursuit Classification Trees
I’ve been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree in that manner. The subtleties of the model are: The model does not prune but keeps splitting until achieving purity With more than two classes, it treats the data as a two-...
3704 sym
Recent Changes to caret
Here is a summary of some recent changes to caret. Feature Updates: train was updated to utilize recent changes in the gbm package that allow for boosting with three or more classes (via the multinomial distribution) The Yeo-Johnson power transformation was added. This is very similar to the Box-Cox transformation, but it does not require the d...
1590 sym
Feature Selection 3 – Swarm Mentality
“Bees don’t swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?” – Hla Stavhana In the last two posts, genetic algorithms were used as feature wrappers to search for more effective subsets of predictors. Here, I will do the same with another type of search algorithm: particle swarm optimization. Like gen...
6185 sym R (3957 sym/5 pcs) 10 img
type = “what?”
One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However, this does tend to bite us periodically. When I teach newcomers about R and predictive modeling, I have a slide that illustrates one of the weaknesses of this system: heteroge...
2543 sym 4 img
Measuring Associations
In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al (2011). A summary of the initial reactions to the MIC are Speed and Tibshirani (and others can be found here). My (minor) beef with it is the lack of a probabilistic motivati...
5015 sym R (1746 sym/5 pcs) 4 img
UseR! 2013 Highlights
The conference was excellent this year. My highlights: Bojan Mihaljevic gave a great presentation on machine learning models built from network models. Their package isn’t on CRAN yet, but I’m really looking forward to it. Jim Harner’s presentation on K-NN models with feature selection was also very interesting, especially the computation...
2082 sym