Publications by John Mount

How to Avoid the dplyr Dependency Driven Result Corruption

06.12.2017

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases. To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed...

981 sym

Getting started with seplyr

14.12.2017

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool. For how and why, please check out our new introductory article. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Bl...

632 sym 2 img

More Pipes in R

16.12.2017

Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners). data.table has essentially used the square bracket sequence “][” in a manner equivalent to piping in R since about 2006. Here is an example. The Bizarro Pipe “->.;” has always been possible in R,...

1250 sym 2 img

How to Greatly Speed Up Your Spark Queries

20.12.2017

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.” The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri Lartigue 1912 The issue arises because wide tables (200 to ...

2081 sym R (2492 sym/8 pcs) 4 img

Plotting Deep Learning Model Performance Trajectories

23.12.2017

I am excited to share a new deep learning model performance trajectory graph. Here is an example produced based on Keras in R using ggplot2: The ideas include: We plot model performance as a function of training epoch, data set (training and validation), and metric. For legibility we facet on metric, and facets are adjusted so all facets have t...

1461 sym 2 img

Announcing rquery

28.12.2017

We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production). As an example: rquery operato...

1510 sym R (704 sym/2 pcs)

Big cdata News

04.01.2018

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago. However, cdat...

2012 sym 2 img

New wrapr R pipeline feature: wrapr_applicable

06.01.2018

The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” app...

778 sym 2 img

rquery: Fast Data Manipulation in R

09.01.2018

Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-m...

3991 sym R (1059 sym/1 pcs) 4 img

Setting up RStudio Server quickly on Amazon EC2

13.01.2018

I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environme...

4486 sym R (144 sym/1 pcs) 10 img