Publications by matloff
The “Secret Sauce” Used in Many qeML Functions
In writing an R package, it is often useful to build up some function call in string form, then “execute” the string. To give a really simple example: > s <- '1+1' > eval(parse(text=s)) [1] 2 Quite a lot of trouble to go to just to find that 1+1 = 2? Yes, but this trick can be extremely useful, as we’ll see here. data(svcensus) z <- qePCA(svc...
2383 sym R (467 sym/6 pcs)
qeML Example: Issues of Overfitting, Dimension Reduction Etc.
What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved. We note at the outset what our concluding statement will be: ...
3956 sym R (554 sym/6 pcs)
New Package, New Book!
Sorry I haven’t been very active on this blog lately, but now that I have more time, that will change. I’ve got myriad things to say. To begin with, then, I’ll announce a major new R package, and my new book. qeML package (“quick and easy machine learning”) Featured aspects: Now on CRAN, https://cran.r-project.org/package=qeML. See GitH...
1528 sym
Just How Good Is ChatGPT in Data Science?
Many of you may have heard of ChatGPT, a dazzling new AI tool. We are hearing lots of gushing praise for the tool. Well, how well does it do in data science contexts? I tried a few queries here, and found interesting results. I first requested, “Write an R function that returns every other element of a vector x, starting with the third.” I wo...
3308 sym 2 img
New Statistics Tutorial
I’ve recently completed fastStat, https://github.com/matloff/fastStat,a quick introduction to statistics for those who’ve had a calculus-based probability course. Many such people later need to do statistics, and this will give them quick access. It is modeled after my R tutorial, https://github.com/matloff/fasteR, a quick introduction to R....
2074 sym
A New Approach to Fairness in Machine Learning
During the last year or so, I’ve been quite interested in the issue of fairness in machine learning. This area is more personal for me, as it is the confluence of several interests of mine: My lifelong activity in probability theory, math stat and stat methodology (in which I include ML).My lifelong activism aimed at achieving social justice.My...
1613 sym
Base-R and Tidyverse Code, Side-by-Side
I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy. As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once pro...
1255 sym
Base-R Is Alive and Well
As many readers of this blog know, I strongly believe that R learners should be taught base-R, not the tidyverse. Eventually the students may settle on using a mix of the two paradigms, but at the learning stage they will benefit from the fact that base-R is simple and more powerful. I’ve written my thoughts in a detailed essay. One of the most...
2663 sym 1 tbl
Valuable Webinar in Parallel Computing in R
George Ostrouchov, one of R’s top parallel computing experts, will run a workshop on cluster parallel computing in R next week. BTW, even a multicore laptop is a “cluster,” so anyone can apply this material to their own work even if they don’t have access to a larger multimachine cluster. Related To leave a comment for the author, pleas...
691 sym
Use of Differential Privacy in the US Census–All for Nothing?
The field of data privacy has long been of broad interest. In a medical database, for instance, how can administrators enable statistical analysis by medical researchers, while at the same time protecting the privacy of individual patients? Over the years, many methods have been proposed and used. I’ve done some work in the area myself. But in ...
2733 sym