Publications by That’s so Random

Designing our bathroom with R

10.08.2016

R has been an indispensable tool since I started working with it about five years ago. Of course in my day job as a data scientist I couldn’t live without it, but it also proved to be a great aid in private life. Recently we bought our first house and R came to the rescue several times in the process. We compared the impact of different mortgag...

2209 sym R (3794 sym/1 pcs) 6 img

Building a column selecter

27.11.2016

Maybe the following sounds familiar. You have a large data set with many, many columns of which the most are irrelevant to you. Typically, a dump from a database or the full set extracted from an API. Several times I found myself the better part of an afternoon going back and forth between a view of the data where I tried to figure out which colu...

2515 sym R (1216 sym/3 pcs) 2 img

Introducing padr

17.01.2017

I am happy to introduce the padr package, which is now available on CRAN. If you frequently work with data containing a timestamp, especially automatically created data, you might find this package helpful. It solves two problems that you can be confronted with when preparing datetime data for analysis. First, data is often recorded on too low a ...

3436 sym R (934 sym/7 pcs) 2 img

A wrapper around nested ifelse

07.02.2017

The ifelse function is the way to do vectorised if then else in R. One of the first cool things I learned to do in R a few years back, I got from Norman Matloff’s The Art of R Programming. When you have more than one if then statements, you just nest multiple ifelse functions before you reach the else. set.seed(0310) x <- runif(1000, 1, 20) y <...

2046 sym R (1536 sym/7 pcs)

padr::pad does now do group padding

18.02.2017

A few weeks ago padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr") for a general introduction. In v0.2.0 the pad function is extended with a group argument, which makes your life a lot easier when you want to do padding within groups. In...

2935 sym R (868 sym/2 pcs)

Tree-based univariate testing

26.02.2017

When building a predictive model it is a good idea to do a univariate analysis, before throwing the whole bunch in a complex algorithm. This way we get a feel for the potential contribution of each predictor. When a lot of predictors are available one can often make a first selection and only use predictors that show univariate predictive power. ...

5809 sym R (3536 sym/11 pcs) 8 img

Preparing Datetime Data for Analysis with padr and dplyr

19.03.2017

Two months ago padr was introduced, followed by an improved version that allowed for applying pad on group level. See the introduction blogs or the vignette("padr") for more package information. In this blog I give four more elaborate examples on how to go from raw data to insight with padr, dplyr and ggplot2. They might serve as recipes for tim...

3212 sym R (1636 sym/8 pcs) 8 img

Binning Outliers in a Histogram

26.04.2017

I guess we all use it, the good old histogram. One of the first things we are taught in Introduction to Statistics and routinely applied whenever coming across a new continuous variable. However, it easily gets messed up by outliers. Putting most of the data into a single bin or a few bins, and scattering the outliers barely visible over the x-ax...

2158 sym R (4612 sym/5 pcs) 8 img

Here is the new padr

16.05.2017

I am very happy to announce v0.3.0 of the padr package, which was introduced in January. As requested by many, you are now able to use intervals of which the unit is different from 1. In earlier version the eight interval values only allowed for a single unit (e.g. year, day, hour). Now you can use any time period that is accepted by seq.Date or ...

5461 sym R (874 sym/8 pcs) 2 img

Check Data Quality with padr

26.06.2017

The padr package was designed to prepare datetime data for analysis. That is, to take raw, timestamped data, and quickly convert it into a tidy format that can be analyzed with all the tidyverse tools. Recently, a colleague and I discovered a second use for the package that I had not anticipated: checking data quality. Every analysis should conta...

2466 sym R (1503 sym/16 pcs) 4 img