Publications by C
Iris Data Set Visualization Web App in < 100 LOC
The iris data set pops up pretty regularly in statistical literature. It consists of 50 records from three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). I came across it recently while reading Introduction to Data Mining. It comes up in several places in the book to demonstrate techniques for visualization a...
9551 sym 10 img 1 tbl
GitHub Stats on Programming Languages
GitHub has become a popular site for Open Source Developers to stash code and collaborate on projects. The following are some stats and analysis related to programming languages in use based upon the number of users and repositories. The data was obtained from GitHub’s searches. It and the R code are available in GitHub as ...
5917 sym 10 img
Programming Language Popularity: StackOverflow and Ohloh
In the following example, programming language popularity is measured based upon two data sets. The first is the number of contributors associated with a language on ohloh.net. The second is tag usage at stackoverflow.com. SQL with no DDLI admit it… in an age of NoSQL… I like SQL. I agree that fixed table schemas can be a real...
6460 sym 6 img
Map of Upcoming Ruby Conferences
One of the top searches on rubyflow is “conference”. A recent post showed how to create a map with the location of the 2010 R User Conference. So why not expand on the subject and create a map with numerous conference locations throughout the world? This post shows how to create a map of locations (upcoming Ruby conferences) them straig...
4602 sym 10 img 1 tbl
How Safe is Your Money?
The FDIC regularly publishes a Failed Bank List and related statistics. This post uses data in the original XLS from the FDIC web site which is formatted for human consumption to produce the charts below using R. Note that 2010 data below is incomplete. The chart above is sort of a theme for the analysis. It is interesting to note that the...
7908 sym 38 img
Fractals in R
Atte Tenkanen had a blog on fractals using R for a time. Much of his source code is still available online. To produce his version of the Mandelbrot set:source(‘http://users.utu.fi/attenka/mandelbrot_set.R‘)Fractals (such as the Mandelbrot Set pictured above) are objects that display self-similarity on all scales. Fractal are mat...
1761 sym 4 img
Better than Average
The NIST‘s The Engineering Statistics Handbook includes an Introduction to Time Series Analysis which provides a great way of demonstrating how R can be used to make such calculations. This post replicates the analysis of the data set introduced under Averaging Methods using R. As you might expect, Time Series Analysis is a broad subj...
4621 sym 2 img 1 tbl
Bot Botany – K-Means and ggplot2
So if you had a robot that was an expert at botany – would you have a bot botanist? Among other things, it would need to to distinguish flowers through vision and image processing, and be able to classify various kinds of plants based upon specific characteristics. What do both of these requirements have in common? They can be done using...
4246 sym 6 img
Ah Bach…
As announced by David Smith over at Revolution Analytics, a ggplot2 Case Study Competition is on… Rather than blogging for the last few days, I cobbled together an entry. It is not a particularly mind bending use of ggplot2, but the subject matter is relatively original. It is an brief analysis and visualization of a J.S...
537 sym 2 img 1 tbl
Elder Research Two Day Course
… or what I did on my summer vacation…Just got back from the Elder Research Two Day Course “Tools for Discovering Patterns in Data“. It was a great course that (while not R specific) provides a great overview of Data Mining tools and techniques and insight into current applications in a wide variety of industries. Dr. El...
897 sym 3 tbl