Publications by Daniel Oehm
Alone R package: Datasets from the survival TV series
I have been watching the survival TV series ‘Alone,’ where 10 survivalists are dropped in an extremely remote area and must fend for themselves. I am super impressed by their skills, endurance, and mental fortitude. To last 100 days in the Arctic winter living off the land is truly impressive. True to form, I’ve collected the data and I am ...
4515 sym R (1064 sym/4 pcs) 6 img
How to crop an image to a circle in R with {cropcircles}
I hadn’t found a super-easy-you-don’t-have-to-think-about-it way to crop an image to a circle in R with a transparent background. There’s this stack overflow but I wouldn’t call it straightforward. So, I wrote this small package to do what I wanted. In a nutshell, you pass a vector of image paths, which can be either local or from a URL l...
1840 sym R (747 sym/2 pcs) 4 img
survivoR v1.0 is now on CRAN
I’m happy to announce that survivoR v1.0 is now on CRAN. The package now contains all the features intended for the first major release. A big thank you to Carly Levitz for helping collate and test the data. This post details the major updates since v0.9.12. For a complete list of tables and features of the package please visit the Github page...
5452 sym R (5656 sym/8 pcs) 6 img
Survivor Advantages: Dataset showcase for {survivoR}
Advantages were introduced to Survivor to give players an edge and to shake up the strategy. A successful play can help advance the player further in the game but can also make the player a target if others know about it. Advantages build uncertainty into the game and prompt players to adapt. Advantages, particularly hidden immunity idols are now...
8234 sym R (1118 sym/2 pcs) 16 img
How to use multiple colour scales in ggplot with {ggnewscale}
For week 23 of Tidy Tuesday the chart I wanted to make required two colour scales. For context the dataset detailed pride sponsors that also contributed to anti-LGBTQ+ politicians. TL;DR I wanted to make some rainbows with rainbow colours if the company made the HRC business pledge and a neutral colour for the companies that hadn’t. I could us...
2427 sym R (1218 sym/3 pcs) 8 img
Survivor Confessionals Data: Dataset showcase for {survivoR}
Confessionals loosely represent a player’s screen time where they talk strategy and replay events. It is an imperfect measure but can indicate success in the game. It’s often used to show balance or imbalance in the editing. This is a high-level summary of confessionals a showcase of the dataset and an analysis of the edit for key demographic...
8254 sym 16 img
Advanced Survey Design and Application to Big Data
I like to describe Official statistics as the All Bran of statistics, it’s bland and a bit boring but it is good for you. It is key for any government to manage the economy, provide services where they are needed and monitor the growth of the nation. There are many facets to official statistics but where statistics plays a major role is in adva...
14977 sym R (9393 sym/51 pcs) 56 img
Improve Your Training Set with Unsupervised Learning
On my previous post Advanced Survey Design and Application to Big Data I mentioned unsupervised learning can be used to generate a stratification variable. In this post I want to elaborate on this point and how they can work together to improve estimates and training data for predictive models. SRS and stratified samples Consider the estimators o...
3137 sym 36 img
Confidentialise Your Data with the randomNames Package
Sensitive data has it’s restrictions for good reason. Personal data such as names and other identifiable information should be protected. Policies are in place to prevent any accidental data breach by governments and businesses. This can be hurdle for data projects, particularly when socialising your work. A common technique is to strip the ind...
2533 sym R (3543 sym/12 pcs) 2 img
PCA vs Autoencoders for Dimensionality Reduction
There are a few ways to reduce the dimensions of large data sets to ensure computational efficiency such as backwards selection, removing variables exhibiting high correlation, high number of missing values but by far the most popular is principal components analysis. A relatively new method of dimensionality reduction is the autoencoder. Autoenc...
7612 sym R (6260 sym/17 pcs) 42 img