Publications by Eric Cai - The Chemical Statistician

Exploratory Data Analysis – Kernel Density Estimation and Rug Plots on Ozone Data in New York and Ozonopolis

30.06.2013

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation. Introduction This post follows the recent introducti...

3663 sym R (2022 sym/2 pcs) 10 img

Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

09.07.2013

Introduction Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data...

6520 sym R (1033 sym/1 pcs) 44 img

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

29.07.2013

Introduction This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the ...

6214 sym Python (2099 sym/4 pcs) 34 img

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

12.08.2013

Introduction Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series.  I will define and calculate the 5-number summary in 2 different ways that are commonly used in R.  (It turns out that different methods ...

7855 sym R (660 sym/4 pcs) 40 img

Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

19.08.2013

Introduction Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis. ...

4312 sym R (1095 sym/9 pcs) 4 img

Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

22.09.2013

Introduction Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from th...

8150 sym Python (1765 sym/5 pcs) 22 img

Detecting an Unfair Die with Bayes’ Theorem

30.10.2013

Introduction I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively. The Problem...

4405 sym Python (1669 sym/1 pcs) 60 img

Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

14.12.2013

Introduction Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce with trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) pr...

4223 sym R (1634 sym/2 pcs) 26 img

Rectangular Integration (a.k.a. The Midpoint Rule)

20.01.2014

Introduction Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the distribution actually integrates to 1 over its support set.  This post follows from my p...

5205 sym R (1799 sym/1 pcs) 54 img

Useful Functions in R for Manipulating Text Data

27.02.2014

Introduction In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-...

3696 sym R (829 sym/8 pcs) 22 img