Publications by matloff
More on the Heteroscedasticity Issue
In my last post, I dsciussed R software, including mine, that handles heteroscedastic settings for linear and nonlinear regression models. Several readers had interesting comments and questions, which I will address here. To review: Though most books and software assume homoscedasticity, i.e. constancy of the variance of the response variable at ...
3427 sym R (206 sym/1 pcs) 4 img
Unbalanced Data Is a Problem? No, BALANCED Data Is Worse
Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — art...
4046 sym 4 img
A New Method for Statistical Disclosure Limitation, I
The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow research...
5840 sym R (263 sym/2 pcs) 4 img 1 tbl
Partools, Recommender Systems and More
Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: re...
7806 sym R (516 sym/2 pcs) 4 img
Back to the BLAS Issue
A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I also used this setting as an ex...
3335 sym 4 img
OVA vs. AVA in Classification Problems, via regtools
OVA and AVA? Huh? These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package. As some of you know, I’m developing the latter in conjunction with a book I’m writing on ...
4412 sym R (952 sym/2 pcs) 4 img
The Method of Boosting
One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability. Much has been made o...
6064 sym R (163 sym/1 pcs) 4 img
The Generalized Method of Moments and the gmm package
An almost-as-famous alternative to the famous Maximum Likelihood Estimation is the Method of Moments. MM has always been a favorite of mine because it often requires fewer distributional assumptions than MLE, and also because MM is much easier to explain than MLE to students and consulting clients. CRAN has a package gmm that does MM, actually th...
2809 sym R (588 sym/4 pcs) 6 img
Some Comments on Donaho’s “50 Years of Data Science”
An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donaho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donaho is not only a grandmaster theoretician, but also a statistical philosopher. The pap...
6675 sym 4 img
50% Draft of Forthcoming Book Available
As I’ve mentioned here a couple of times, I am in the midst of writing a book, From Linear Models to Machine Learning: Regression and Classification, with Examples in R. As has been my practice with past books, I have now placed a 50% rough draft of the book on the Web. You will see even from this partial version that I take a very different ap...
995 sym 4 img