Publications by Peter's stats stuff - R

New Zealand Data & APIs on GitHub

31.07.2015

New data listing for New Zealand Wellington’s PrototypeAlex has created a new GitHub repository aiming to list data about New Zealand. My first reaction was “hmm, will this be an improvement?” After all, we already have data.govt.nz which is meant to be the definitive aggregator of government datasets; we have the Figure.NZ (recently rebra...

7905 sym R (2399 sym/2 pcs) 4 img

Simulating backgammon players’ Elo ratings

06.08.2015

Probabilities of winning with a given rating Backgammon clubs and on-line forums use a modified form of the Elo rating system to keep track of how well individuals have played and draw inferences about their underlying strength. The higher the rating, the stronger the player. Players with higher ratings are inferred to be stronger than those wi...

5100 sym R (5320 sym/3 pcs) 4 img

Importing the New Zealand Income Survey SURF

14.08.2015

The quest for income microdata For a separate project, I’ve been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it’s sensitve personal financial data either from surveys ...

6407 sym R (10993 sym/10 pcs) 6 img 6 tbl

A better way of visualising income distributions with zeroes and negatives

20.08.2015

I wasn’t happy with my visualisation of individual incomes from the New Zealand income survey. Because it used a logarithmic scale to improve readability, in effect all zero and negative values are excluded from the data. Whenever I throw out data, my tail goes bushy… there has to be a better way. Those zero and negative values are an impo...

3717 sym R (1216 sym/2 pcs) 4 img

Getting started in applied statistics / datascience

29.08.2015

Where to start to start? I was recently asked by a colleague manager from another organisation what direction they could give to a staff member interested in building skills in the whole “big data” thing. A search of the web shows hundreds if not thousands of sites and blog posts aimed at budding data scientists, but most of them seem (to my...

10704 sym R (2026 sym/1 pcs) 2 img

Creating a scale transformation

04.09.2015

A better transformation than my better transformation In an earlier post I put forward the idea of a modulus power transform – basically the square root (or other similar power transformation) of the absolute value of a variable like income, followed by restoring the sign to it. The idea is to avoid throwing away values of zero or less, which ...

3585 sym R (1627 sym/4 pcs) 8 img

Transforming the breaks to match a scale

06.09.2015

Something missing In my last post I developed a new scale transformation for R using the approach and platform from the {ggplot2} and {scales}. I implemented a method proposed in 1980 by John and Draper that does some of the job of a logarithmic transform in reducing the dominance on the page of the large values, but is also continuous through ze...

4517 sym R (2595 sym/6 pcs) 10 img

Sampling distribution of Gini coefficient

11.09.2015

Inequality measures Part of my motivation for importing the New Zealand Income Survey(NZIS) simulated unit record file provided by Statistics New Zealand was to explore the characteristics of various measures of inequality. In particular, I’m interested in what happens to the sampling distributions as sample size changes of the following summa...

9867 sym R (2627 sym/3 pcs) 8 img

Autocorrelation functions of materially different time series

18.09.2015

Comparing two timeseries-generating blackboxes This question on Cross-Validated got me interested. I gave a fairly inadequate answer and want to explore a few of the issues. Actually, I have a plan for an effective technique which is what I think the original post was asking for, but I need to check out a few things first. The challenge, if I u...

5073 sym R (1917 sym/2 pcs) 6 img

How to compare two blackbox timeseries generators?

19.09.2015

Comparing two timeseries-generating blackboxes In my last post I talked about how this question on Cross-Validated got me interested. Basically the challenge is to compare two data generating models to see if they are essentially the same. Since then I’ve noticed that this problem comes up in a number of other contexts too; for example, this N...

6867 sym R (2858 sym/5 pcs) 10 img