Publications by Christopher Bare
Using R for Introductory Statistics, Chapter 5, hypergeometric distribution
This is a little digression from Chapter 5 of Using R for Introductory Statistics that led me to the hypergeometric distribution. Question 5.13 A sample of 100 people is drawn from a population of 600,000. If it is known that 40% of the population has a specific attribute, what is the probability that 35 or fewer in the sample have that attribut...
2892 sym R (608 sym/7 pcs) 10 img
Using R for Introductory Statistics, The Geometric distribution
We’ve already seen two discrete probability distributions, the binomial and the hypergeometric. The binomial distribution describes the number of successes in a series of independent trials with replacement. The hypergeometric distribution describes the number of successes in a series of independent trials without replacement. Chapter 6 of Usin...
2457 sym R (252 sym/1 pcs) 12 img
Using R for Introductory Statistics 6, Simulations
R can easily generate random samples from a whole library of probability distributions. We might want to do this to gain insight into the distribution’s shape and properties. A tricky aspect of statistics is that results like the central limit theorem come with caveats, such as “…for sufficiently large n…”. Getting a feel for how large ...
3341 sym R (1613 sym/7 pcs) 16 img
Environments in R
One interesting thing about R is that you can get down into the insides fairly easily. You’re allowed to see more of how things are put together than in most languages. One of the ways R does this is by having first-class environments. At first glance, environments are simple enough. An environment is just a place to store variables – a set o...
9117 sym R (2674 sym/19 pcs) 4 img
Drawing heatmaps in R
A while back, while reading chapter 4 of Using R for Introductory Statistics, I fooled around with the mtcars dataset giving mechanical and performance properties of cars from the early 70’s. Let’s plot this data as a hierarchically clustered heatmap. # scale data to mean=0, sd=1 and convert to matrix mtscaled <- as.matrix(scale(mtcars)) # c...
3170 sym R (667 sym/3 pcs) 8 img
Notes on Engineering Data Analysis (with R and ggplot2)
Hadley Wickham gave a Google Tech Talk a couple weeks back titled Engineering Data Analysis (with R and ggplot2). These are my notes. The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to oth...
2949 sym 2 img
MySQL and R
Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on. Typically, most folks are going to want to analyze data that’s already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make t...
3385 sym R (1917 sym/5 pcs)
String functions in R
Here’s a quick cheat-sheet on string manipulation functions in R, mostly cribbed from Quick-R’s list of String Functions with a few additional links. substr(x, start=n1, stop=n2) grep(pattern,x, value=FALSE, ignore.case=FALSE, fixed=FALSE) gsub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE) gregexpr(pattern, text, ignore.case=FA...
1116 sym
Hipster programming languages
If you look at the programming languages that are popular these days, a few patterns emerge. I’m not talking about languages that have the most hits on the job sites. I’m talking about what the cool kids are coding in – the folks that hang out on hacker-news or at Strange Loop. Languages like Clojure, Scala and CoffeeScript. What do these d...
5469 sym 6 img
International Open Data Hackathon
This past Saturday, I hung out at the Seattle branch of the International Open Data Hackathon. The event was hosted at the Pioneer Square office of Socrata, a small company that helps governments provide public open data. A pair of data analysts from Tableau were showing off a visualization for the Washington Post’s FactChecker blog called Comp...
2663 sym R (593 sym/3 pcs) 4 img