Publications by matloff
Update on Snowdoop, a MapReduce Alternative
In blog posts a few months ago, I proposed an alternative to MapReduce, e.g. to Hadoop, which I called “Snowdoop.” I pointed out that systems like Hadoop and Spark are very difficult to install and configure, are either too primitive (Hadoop) or too abstract (Spark) to program, and above all, are SLOW. Spark is of course a great improvement...
4425 sym R (715 sym/4 pcs) 4 img
Discovered Two Great Web Sites Today
Today is my lucky day. I learned of two very interesting Web pages, both of them quite informative and the first of them rather provocative (yay!). I have some comments on both, in some cases consisting of mild disagreement, which I may post later, but in any event, I highly recommend both. Here they are: Drew Schmidt’s take on parallel co...
882 sym 4 img
Macros in R
In programming, sometimes it’s useful to write a macro rather than a function. (Don’t worry if you’ve never heard the term before.) In this post, I’ll give an example of use of macros in R. using the gtools package on CRAN. I wanted to write some utility code to help me reuse my earlier R commands during an interactive R session. Most (...
3263 sym R (129 sym/2 pcs) 4 img
Heteroscedasticity in Regression — It Matters!
R’s main linear and nonlinear regression functions, lm() and nls(), report standard errors for parameter estimates under the assumption of homoscedasticity, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor var...
3760 sym R (1293 sym/4 pcs) 4 img
CACM Highlights R
The Association for Computing Machinery is the main professional organization for computer science, largely for academia but still with a broad membership. ACM publishes a number of journals, most of them for research but its flagship publication is a magazine, the Communications of the ACM. The current issue of the CACM includes an article, “B...
2638 sym 4 img
partools: a Sensible R Package for Large Data Sets
As I mentioned recently, the new, greatly extended version of my partools package is now on CRAN. (The current version on CRAN is 1.1.3, whereas at the time of my previous announcement it was only 1.1.1. Note that Unix is NOT required.) It is my contention that for most R users who work with large data, partools — or methods like it — is a...
4265 sym 4 img
Partools 1.1.4
Partools 1.1.4 is now on GitHub. The main change this time is enhancement of the debugging facilities (which work not only for partools but also the cluster-based portion of R’s parallel package in general). As some of you know, I place huge importance on debugging, so much so that I wrote a book on it (The Art of Debugging with GDB, DDD, and E...
2198 sym 4 img
Exciting userR! 2016 Conference
The 2016 meeting of the annual useR! conference will be held in June at Stanford University. This is a fantastic venue, and we believe it may be the largest useR! meeting to date. See the above link for details! Related To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com ...
616 sym 4 img
New R Software/Methodology for Handling Missing Dat
I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu. There is a long history of development of techniques for handling missing data. See the fam...
4532 sym 4 img 1 tbl
Can You Say “Heteroscedasticity” 3 Times Fast?
Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can...
2299 sym R (514 sym/2 pcs) 4 img