Publications by Ron Pearson (aka TheNoodleDoodler)

The Art of Exploratory Data Analysis

22.01.2011

This blog is about the art of exploratory data analysis, which is also the subject of my new book, Exploring Data in Engineering, the Sciences, and Medicine (http://www.oup.com/us/ExploringData). This art is appropriate in situations where you are faced with an existing dataset that you want to understand better. As Stanford University stati...

11440 sym 4 img

Boxplots and Beyond – Part I

29.01.2011

Boxplots are a simple and reasonably popular way of summarizing the range of variation of a real-valued variable across different subsets of data. Typical examples might include diastolic blood pressure across a group of patients, broken down by gender and smoking status, or the breaking strength of material samples broken down by ...

9386 sym 10 img

Boxplots and Beyond – Part II: Asymmetry

06.02.2011

In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R. As I noted in that post, the basic boxplot is both useful and popular, but it does have its limitations. One of those limitations is that the standard boxp...

10540 sym 6 img

Boxplots and Beyond III: Violin Plots

15.02.2011

This post is the third in a series of four on boxplots and closely related data visualization techniques for comparing subsets of a dataset, or comparing different datasets that we hope or expect to be similarly distributed. The previous two posts in this series have dealt with the basic boxplot, simple variations like log transformations and v...

6657 sym 8 img

Boxplots & Beyond IV: Beanplots

05.03.2011

This post is the last in a series of four on boxplots and some of their extensions. Previous posts in this series have discussed basic boxplots, modified boxplots based on a robust asymmetry measure, and violin plots, an alternative that essentially combines boxplots with nonparametric density estimates. This post introduces beanplots, a boxp...

14301 sym 10 img

Interestingness Measures

03.04.2011

Probably because I first encountered them somewhat late in my professional life, I am fascinated by categorical data types. Without question, my favorite book on the subject is Alan Agresti’s Categorical Data Analysis (Wiley Series in Probability and Statistics), which provides a well-integrated, comprehensive treatment of the analysis of cat...

21681 sym 18 img

Screening for predictive characteristics … and a mea culpa

12.04.2011

In my last post, I considered the UCI mushroom dataset and characterized the variables included there using four different interestingness measures. When I began drafting this post, my intention was to consider the question of how the different mushroom characteristics included in this dataset relate to each mushroom’s classification as edibl...

12421 sym 12 img

Measuring association using odds ratios

23.04.2011

In my last two posts, I have used the UCI mushroom dataset to illustrate two things. The first was the use of interestingness measures to characterize categorical variables, and the second was the use of binary confidence intervals to visualize the relationship between a categorical predictor variable and a binary response variable. This ...

14424 sym 8 img

Computing Odds Ratios in R

07.05.2011

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset. I did not, however, describe those computations in detail, and the purpose of this post is to give a brief discussion of how they were done. Th...

11048 sym 12 img

The distribution of interestingness

21.05.2011

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3rd post on “Interestingness Measures.” He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment:“Our belief is that a useful measure of interestingness should generate index val...

23000 sym 14 img