Publications by David Smith
Homer, not Bart, is the star of the Simpsons
It's been a long time since I watched the The Simpsons, but I was always under the impression that Bart was the primary character. Perhaps it was all the Do the Bartman and “Cowabunga!” nonsense from the 90s. Anyway, data scientist Todd W Schneider used R to analyze the scripts of the first 26 seasons and found that Homer speaks twice as much...
1548 sym 2 img
Statcheck: an R package to check statistical results in psychology papers
The results of many scientific papers are wrong. There are many reasons for this, including p-hacking, publication bias, and the general inability to replicate results. But there's another, more mundane cause: incorrect calculation of p-values in statistical tests. This could be caused by simple transcription errors when plugging numbers into...
3071 sym 2 img
Import data to R from SAS, SPSS and Stata with Haven
Regardless of the tool you use to analyse data, you'll often have to access data living in file formats generated by other tools. The “haven” package from RStudio allows you to import and export data in SAS, SPSS and Stata formats. Version 1.0 was released on October 4, and is now available on CRAN. Haven is also installed as part of the ...
1922 sym
In case you missed it: Septemer 2016 roundup
In case you missed them, here are some articles from September of particular interest to R users. The R-Ladies meetups and the Women in R Taskforce support gender diversity in the R community. Highlights from the Microsoft Data Science Summit include recordings of many presentations about R, and the keynote “The Future of Data Analysis” by ...
2872 sym
Make ggplot graphics2 interactive with ggiraph
R's ggplot2 package is a well-known tool for producing beautiful static data visualizations that you can include in a printed report. But what if you want to include a ggplot2 graphic on a webpage and provide the ability for the user to interact with the data? The ggiraph package by David Gohel (available for installation via CRAN). WIth ggirap...
2015 sym
Watch the world warm with this animated globe, created with R
Due to anthropogenic climate change, the average global temperature has increased steadily over the past decade or so. While we're all familiar with the hockey-stick line chart of rising temperature, the change is even more dramatic on this animated globe showing the local effects of climate change. The first half of the animation shows the mon...
2212 sym 4 img
Tutorial: Scalable R on Spark with SparkR, sparklyr and RevoScaleR
If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying transformations and models; Spark provides the storage and computation engine to handle data much larger than R alone can handle. At the KDD 2016 conferenc...
3721 sym 2 img
Upcoming Practical Data Science courses in London, Chicago, Zurich, Oslo and Stockholm
If you'd like to learn how to run R within Azure Machine Learning and SQL Server, you may be interested in these upcoming 4-day Practical Data Science courses, presented by Rafal Lukawiecki from Project Botticelli. In this classroom-based course, you will learn machine learning, data mining, some statistics, data preparation, and how to interpr...
1601 sym
Make tilegrams in R with tilegramsR
In this busy election season (here in the US, at least), we're seeing a lot of maps. Some states are red, some states are blue. But there's a problem: voters are not evenly distributed throughout the United States. In this map (the firethirtyeight.com US election forecast on October 13) Montana (MT) is a large state shaded red, but only represent...
2309 sym 6 img
The Team Data Science Process
As more and more organizations are setting up teams of data scientists to make sense of the massive amounts of data they collect, the need grows for a standardized process for managing the work of those teams. To help with this, the data science team at Microsoft has drawn on their experience with large-scale data science projects to develop the ...
2030 sym 2 img