Publications by Ron Pearson (aka TheNoodleDoodler)
Measuring associations between non-numeric variables
It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated? In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century. For variables that are...
11420 sym 6 img
Gastwirth’s location estimator
The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data. The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of an otherwise reasonable data analysis. Th...
12770 sym 10 img
David Olive’s median confidence interval
As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence. Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity. This post briefly describes a simple confidence interval for the median that is d...
18315 sym 8 img
Interestingness comparisons
In three previous posts (April 3, 2011, April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables. Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure...
11886 sym 8 img
Classifying the UCI mushrooms
In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems. Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the UCI mushroom dataset. The proposed variable selection strategy w...
11124 sym
Graphical insights from the 2012 UseR! Meeting
About this time last month, I attended the 2012 UseR! Meeting. Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe. This year’s meeting was held on the Vanderbilt University campus in Nashville, TN, and it was attended by about 500...
8192 sym
Base versus grid graphics
In a comment in response to my latest post, Robert Young took issue with my characterization of grid as an R graphics package. Perhaps grid is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display you want, it is sometimes necessary to use c...
10628 sym 4 img
Implementing the CountSummary Procedure
In my last post, I described and demonstrated the CountSummary procedure to be included in the ExploringData package that I am in the process of developing. This procedure generates a collection of graphical data summaries for a count data sequence, based on the distplot, Ord_plot, and Ord_estimate functions from the vcd package. The distplot...
14607 sym 6 img
Spacing measures: heterogeneity in numerical distributions
Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data). In addition, numerically coded values can be effectively categorical, either orde...
13572 sym 4 img
Characterizing a new dataset
In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly. So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post a...
13089 sym