Publications by John Mount
How to Avoid the dplyr Dependency Driven Result Corruption
In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases. To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed...
981 sym
Getting started with seplyr
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool. For how and why, please check out our new introductory article. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Bl...
632 sym 2 img
More Pipes in R
Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners). data.table has essentially used the square bracket sequence “][” in a manner equivalent to piping in R since about 2006. Here is an example. The Bizarro Pipe “->.;” has always been possible in R,...
1250 sym 2 img
How to Greatly Speed Up Your Spark Queries
For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.” The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri Lartigue 1912 The issue arises because wide tables (200 to ...
2081 sym R (2492 sym/8 pcs) 4 img
Plotting Deep Learning Model Performance Trajectories
I am excited to share a new deep learning model performance trajectory graph. Here is an example produced based on Keras in R using ggplot2: The ideas include: We plot model performance as a function of training epoch, data set (training and validation), and metric. For legibility we facet on metric, and facets are adjusted so all facets have t...
1461 sym 2 img
Announcing rquery
We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production). As an example: rquery operato...
1510 sym R (704 sym/2 pcs)
Big cdata News
I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago. However, cdat...
2012 sym 2 img
New wrapr R pipeline feature: wrapr_applicable
The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” app...
778 sym 2 img
rquery: Fast Data Manipulation in R
Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-m...
3991 sym R (1059 sym/1 pcs) 4 img
Setting up RStudio Server quickly on Amazon EC2
I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environme...
4486 sym R (144 sym/1 pcs) 10 img