Publications by Ron Pearson (aka TheNoodleDoodler)
The pros and cons of robust data characterizations
Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition, that outliers are “data points inconsistent wi...
15748 sym 14 img
Fitting mixture distributions with the R package mixtools
My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful. Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine. One important topic I haven’t covered is how to fit mixture models to datasets like the Old ...
12343 sym 10 img
When are averages useless?
Of all possible single-number characterizations of a data sequence, the average is probably the best known. It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers. It is not the only such “typical value,” however, nor is it always the most useful one: tw...
11288 sym 10 img
Some Additional Thoughts on Useless Averages
In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that I want to respond to here.Fir...
10112 sym 18 img
The Long Tail of the Pareto Distribution
In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization. One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by: p(x) = aka/xa+1defined for all x > k, where k and a are...
11212 sym 12 img
Is the “Long Tail” a Useless Concept?
In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment: “Unfortunately, you’ve fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can’t possibly be the length of the tail that sets distributions like Pareto and Zipf apart; e...
8932 sym 2 img
The Zipf and Zipf-Mandelbrot distributions
In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data. Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular ...
10797 sym 8 img
Harmonic means, reciprocals, and ratios of random variables
In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian. For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generall...
9356 sym 8 img
Cleaning time-series and other data streams
The need to analyze time-series or other forms of streaming data arises frequently in many different application areas. Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like ...
6554 sym 8 img
Moving window filters and the pracma package
In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the R package pracma. In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the filter’s local outlier detection threshold, and the qu...
18652 sym 14 img