Publications by free range statistics - R
Analysing large data on your laptop with a database and R by @ellis2013nz
The New York City Taxi & Limousine Commission’s open data of taxi trip records is rightly a go-to test piece for analytical methods for largish data. Check out, for example: The benchmark of interesting analysis is set by Todd W Schneider’s well-written and rightly famous blog post ‘Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Ven...
19512 sym R (10469 sym/5 pcs) 20 img
Analysing the effectiveness of tennis tournament seeding by @ellis2013nz
So, an exploration of how tennis tournament seeding and bracketing impacts on the end result has percolated to the top of my to-do list, inspired by the Melbourne Open currently in play. This is the first of what will probably be two posts on this topic. Today I wanted to look at the impact of seeding on the chance of the best players finishing f...
7401 sym R (10971 sym/7 pcs) 6 img 1 tbl
Lewis Carroll’s proposed rules for tennis tournaments by @ellis2013nz
Last week I wrote about the impact of seeding the draw in a tennis tournament. Seeding is one way to increase the chance of the top players making it to the final rounds of a single elimination tournament, leading to fairer outcomes and to a higher chance of the best matchups happening in the finals. Basically it’s to overcome this problem: �...
11540 sym R (16125 sym/1 pcs) 10 img
Body Mass Index by @ellis2013nz
BMI has an expectations management problem Body Mass Index (BMI) is an attempt to give a quick back-of-envelope answer to the question “if someone weighs W kg, is that a lot or not very much?” Clearly the answer to that question has to take into account at a minimum the person’s height; in general, whatever may constitute a healthy weight, ...
14298 sym R (8795 sym/3 pcs) 6 img
Log transform or log link? And confounding variables. by @ellis2013nz
Last week I wrote about the relationship between weight and height in US adults, as seen in the US Centers for Disease Control and prevention (CDC) Behavioral Risk Factor Surveillance System, an annual telephone survey of around 400,000 interviews per year. In particular, I tested the widely-circulated claim that Body Mass Index (BMI) exaggerates...
9858 sym R (5962 sym/3 pcs) 8 img 1 tbl
New Zealand Election Study webtool by @ellis2013nz
I’ve just finished updating and deploying a webtool that helps explore data from the New Zealand Election Study. I first built a version of this a few years back with just the 2014 wave of the study; today I’ve added the data from the time of the 2017 election and made a number of small improvements (eg getting the macrons back in to the ‘M...
4181 sym 18 img
COVID-19 cumulative observed case fatality rate over time by @ellis2013nz
Preamble I was slightly reluctant to add to the deluge of charts about the COVID-19 outbreak, but on the other hand making charts is one of the ways I relax and try to understand what’s going on around me. So first, to get out of the way my only advice at this point: wash hands frequently, for 20 seconds at a time, with plenty of soap work at ...
3849 sym R (3876 sym/1 pcs) 4 img
Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz
Italy is routinely and correctly described as particularly vulnerable to COVID-19 because of its older age profile. I set out to understand for myself how important this factor is. What would happen if the case fatality rates observed in Italy were applied to demographic profiles of other countries? Fatality rates by age and sex so far in Italy T...
4997 sym R (5712 sym/1 pcs) 8 img
How to make that crazy Fox News y axis chart with ggplot2 and scales by @ellis2013nz
Possibly you have seen this graphic circulating on social media. At first it doesn’t look too remarkable, but then you notice the vertical axis. Oh. The gridlines are equally spaced on the page, but sometimes the same space represents 30 people, sometimes 10, and sometimes 50. It isn’t even strictly increasing – which might be expected if s...
4434 sym R (2754 sym/4 pcs) 8 img
Pragmatic prediction intervals from a quasi-likelihood GLM by @ellis2013nz
Today’s blog comes with two lessons: a statistical one, and one on troubleshooting. Rubber-duck debugging Troubleshooting lesson first. The problem described below has been causing me grief for several days. I have an important and time-critical use case at work where I need to impute lots of complex missing data using the sort of model I’m a...
9234 sym R (1841 sym/4 pcs) 4 img