Publications by Max Kuhn
UseR! Slides for “Classification Using C5.0”
I’ve had a lot of requests, so here they are. Hopefully, all of the slides will be posted on the conference website. Related To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and man...
534 sym
Equivocal Zones
In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as “equivocal” or “indeterminate” instead of one of the class levels. This only works if th...
2033 sym R (2940 sym/7 pcs) 2 img
The Basics of Encoding Categorical Data for Predictive Models
Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: “Is it bad to feed it non-numerical data such as factors?” As usual, I will try to make my answer as complex as possible. (I’ve heard the old wives tale that eskimos have 180 different words in their language for snow. I’m starting to think th...
5224 sym R (983 sym/5 pcs) 2 img
Down-Sampling Using Random Forests
We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss down-sampling: sample the majority class to make their frequencies closer to the rarest class. up-sampling: the minority class is resampled to increase the corresponding frequencies hyb...
3929 sym R (3078 sym/8 pcs) 4 img
Calibration Affirmation
In the book, we discuss the notion of a probability model being “well calibrated”. There are many different mathematical techniques that classification models use to produce class probabilities. Some of values are “probability-like” in that they are between zero and one and sum to one. This doesn’t necessarily mean that the probability ...
3504 sym R (2160 sym/5 pcs) 8 img
caret webinar on Feb 25
I”ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM – 2:00 PM EST.Here is the url in case you are interested: https://www3.gotomeeting.com/register/673845982Thanks to Ray DiGiacomo for setting this up. Related To leave a comment for the author, please follow the link and comment o...
676 sym
Optimizing Probability Thresholds for Class Imbalances
One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this is that the performance is generally very biased against the class with the smallest frequencies. For example, if the data have a majority of samples belonging to the first cl...
5619 sym R (2227 sym/7 pcs) 4 img
caret webinar materials
The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos. Related To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and man...
534 sym
Bay Area RUG Talk on 3/17 (updated)
I’m making my yearly pilgrimage to San Fransico to teach at PAW. I’ll also be giving a short talk at the Bay Area R Users Group on model tags in the caret package and the code that produced this interactive plot. It is at 7:00 PM on Monday March 17th at San Francisco Marriott Marquis: The slide deck is here. Related To leave a comment fo...
728 sym
Bay Area RUG Talk on 3/17
I’m making my yearly pilgrimage to San Fransico to teach at PAW. I’ll also be giving a short talk at the Bay Area R Users Group on model tags in the caret package and the code that produced this interactive plot. It is at 7:00 PM on Monday March 17th at San Francisco Marriott Marquis: Related To leave a comment for the author, please f...
707 sym