Publications by John Mount

Campaign Response Testing no longer published on Udemy

08.06.2017

Our free video course Campaign Response Testing is no longer published on Udemy. It remains available for free on YouTube with all source code available from GitHub. I’ll try to correct bad links as I find them. Please read on for the reasons. Udemy recently unilaterally instituted a new policy on free courses: “When a free course has a ...

2419 sym

Managing intermediate results when using R/sparklyr

09.06.2017

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr. Handle management Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represen...

1509 sym R (2357 sym/6 pcs) 2 img

Use a Join Controller to Document Your Work

13.06.2017

This note describes a useful replyr tool we call a “join controller” (and is part of our “R and Big Data” series, please see here for the introduction, and here for one our big data courses). When working on real world predictive modeling tasks in production, the ability to join data and document how you join data is paramount. There are...

11147 sym R (7640 sym/29 pcs) 4 img

An easy way to accidentally inflate reported R-squared in linear regression models

15.06.2017

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R. We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. First let’s set up our problem with a data set ...

3263 sym R (3050 sym/22 pcs)

Non-Standard Evaluation and Function Composition in R

16.06.2017

In this article we will discuss composing standard-evaluation interfaces (SE) and composing non-standard-evaluation interfaces (NSE) in R. In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces. To use it you must know some of its structure and notation. Here are som...

5199 sym R (3238 sym/33 pcs)

wrapr Implementation Update

18.06.2017

Introduction The development version of our R helper function wrapr::let() has switched from string-based substitution to abstract syntax tree based substitution (AST based subsitution, or language based substitution). I am looking for some feedback from wrapr::let() users already doing substantial work with wrapr::let(). If you are already usin...

5961 sym R (1791 sym/17 pcs) 2 img

Please Consider Using wrapr::let() for Replacement Tasks

26.06.2017

From dplyr issue 2916. The following appears to work. suppressPackageStartupMessages(library("dplyr")) COL <- "homeworld" starwars %>% group_by(.data[[COL]]) %>% head(n=1) ## # A tibble: 1 x 14 ## # Groups: COL [1] ## name height mass hair_color skin_color eye_color birth_year ## <chr> <int> <dbl> <ch...

951 sym R (1769 sym/8 pcs)

Using wrapr::let() with tidyeval

28.06.2017

While going over some of the discussion related to my last post I came up with a really neat way to use wrapr::let() and rlang/tidyeval together. Please read on to see the situation and example.Suppose we want to parameterize over a couple of names, one denoting a variable coming from the current environment and one denoting a column name. Furth...

1065 sym R (810 sym/1 pcs)

Join Dependency Sorting

01.07.2017

In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called “a join controller” (last discussed here). One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transfor...

6272 sym R (6337 sym/16 pcs) 4 img

Working With R and Big Data: Use Replyr

06.07.2017

In our latest R and Big Data article we discuss replyr. Why replyr replyr stands for REmote PLYing of big data for R. Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark). replyr allows users to work with Spark or database data similar to how they w...

5255 sym R (10838 sym/52 pcs)