Publications by Peter's stats stuff - R
Bootstrap and cross-validation for evaluating modelling strategies
Modelling strategies I’ve been re-reading Frank Harrell’s Regression Modelling Strategies, a must read for anyone who ever fits a regression model, although be prepared – depending on your background, you might get 30 pages in and suddenly become convinced you’ve been doing nearly everything wrong before, which can be disturbing. I wanted...
8644 sym R (8873 sym/1 pcs) 4 img
Presentation slides on using graphics
Last week I gave a seminar for around 40 analysts from another government agency on using graphics to represent data. In doing such presentations, I usually focus on different purposes of graphics: exploratory as part of the analysis workflow (eg as diagnosis for statistical models) for presenting results Exactly what the purpose is makes quit...
2690 sym 2 img
Monthly Regional Tourism Estimates
A big 18 month project at work culminated today in the release of new Monthly Regional Tourism Estimates for New Zealand. Great work by the team in an area where we’ve pioneered the way, using administrative data from electronic transactions to supplement traditional sources in producing official statistics. Here’s a screen shot from one of ...
1668 sym 2 img
International Household Income Inequality data
I’m at the New Zealand Association of Economists annual conference in Auckland. The opening keynote speech was from James K. Galbraith on a global view of inequality. He showed a variety of results from the University of Texas Inequality Project’s Estimated Household Income Inequality dataset, which I hadn’t realised existed before. It�...
2026 sym R (1673 sym/3 pcs) 4 img
Animated world inequality map
In my last post I had a first look (for me) at Estimated Household Income Inequality data from the University of Texas Inequality Project. These data came to my attention when Professor James K. Galbraith used them in his keynote presentation to the 2016 New Zealand Association of Economists conference. Some of the slides associated with these...
3651 sym R (6682 sym/1 pcs) 2 img
nzelect 0.2.0 on CRAN
Introduction The nzelect R package which I first introduced in a blog post in April is now available on CRAN. The version number is 0.2.0. The difference from version 0.1.0 is sizeable – all the 2013 census data has been removed and is now in a companion package, nzcensus. This is for ease of development and maintenance, and to allow organisa...
2959 sym R (1657 sym/1 pcs) 2 img
nzcensus on GitHub
Introduction A few months back the first, pre-CRAN versions of my nzelect package included some data from the New Zealand Census 2013. As noted in my last post, I’ve now split this into a separate nzcensus package, for ease of development and maintenance and to allow nzelect to fit within CRAN size restrictions. The nzcensus package has a set ...
15283 sym R (9545 sym/10 pcs) 10 img
Elastic net regularization of a model of burned calories
Deal with feature selection and collinearity Recently I’ve been making more use of elastic net regularization as a way of fitting linear models to data when I have more candidate explanatory variables than I know what to do with and some of them are collinear ie their information doubles up on what is in other variables. Elastic net regularizat...
13430 sym R (7148 sym/12 pcs) 12 img
Dual axes time series plots may be ok sometimes after all
Are they really as bad as all that? I’ve been mulling over time series charts with two different y axes, which are widely deprecated in the world of people who see ourselves as professional data workers. Looking down on dual axis time series charts is one of the things that mark one as a member of a serious data visualiser – after shaking ou...
15058 sym R (5009 sym/2 pcs) 20 img
Dual axes time series plots with various more awkward data
In my most recent blog post I introduced the dualplot() R function, which allows you to create time series plots with two different scales on the vertical axes in a way that minimises the potential problems of misinterpretation. See that earlier post for a discussion of the pros and cons of the whole approach, which I won’t repeat here. I’ve...
9005 sym R (3686 sym/7 pcs) 14 img