Publications by John Mount

Some Announcements

24.10.2017

Some Announcements: Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”, Sunday, October 29, 2017 10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area). ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk. Thursday Nov 2 2017, 2:00 PM...

1466 sym

Big Data Transforms

29.10.2017

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark). Often we come to a point where we or a partner realize: “the design would be a ...

2273 sym R (3459 sym/19 pcs) 2 img

Let X=X in R

03.11.2017

Our article “Let’s Have Some Sympathy For The Part-time R User” includes two points: Sometimes you have to write parameterized or re-usable code. The methods for doing this should be easy and legible. The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the wrapr p...

4266 sym R (4222 sym/11 pcs) 2 img

Update on coordinatized or fluid data

12.11.2017

We have just released a major update of the cdata R package to CRAN. If you work with R and data, now is the time to check out the cdata package. Among the changes in the 0.5.* version of cdata package: All coordinatized data or fluid data operations are now in the cdata package (no longer split between the cdata and replyr packages). The tran...

1946 sym 2 img

Data Wrangling at Scale

15.11.2017

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template). Please check it out. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other t...

520 sym 2 img

RStudio Keyboard Shortcuts for Pipes

18.11.2017

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R. You can install the add-ins from here (which also includes both installation instructions and use instructions/examples). Related To leave a comment for the author, please follow the link and comment on their blog: R �...

652 sym 6 img

Arbitrary Data Transforms Using cdata

22.11.2017

We have been writing a lot on higher-order data transforms lately: Coordinatized Data: A Fluid Data Specification Data Wrangling at Scale Fluid Data Big Data Transforms. What I want to do now is “write a bit more, so I finally feel I have been concise.” The cdata R package supplies general data transform operators. The whole system is bas...

4133 sym R (1199 sym/7 pcs) 2 img 2 tbl

Vectorized Block ifelse in R

27.11.2017

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R. From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. When porting code from one language to another you hope the expressive powe...

2498 sym

Win-Vector LLC announces new “big data in R” tools

29.11.2017

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN): partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache ...

2201 sym 2 img

Please inspect your dplyr+database code

02.12.2017

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() sta...

2140 sym