Publications by John Mount

What is vtreat?

14.08.2019

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other ...

2425 sym 2 img

Introducing data_algebra

26.08.2019

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable)...

12441 sym R (11535 sym/36 pcs) 2 img 4 tbl

It is Time for CRAN to Ban Package Ads

30.08.2019

NPM (a popular Javascript package repository) just banned package advertisements. I feel the CRAN repository should do the same. Not all R-users are fully aware of package advertisements. But they clutter up work, interfere with reproducibility, and frankly are just wrong. Here is an example which could be considered to contain advertisements: ...

2538 sym

Why R?

30.08.2019

I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book. In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment. This is why w...

811 sym

Advanced Data Reshaping in Python and R

04.09.2019

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The user specifies their desired t...

7438 sym R (6478 sym/26 pcs) 5 tbl

Practical Data Science with R update

15.09.2019

Just got the following note from a new reader: Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands. Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required...

995 sym

The Advantages of Record Transform Specifications

18.09.2019

Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R. I will use this example to show some of the advantages of cdata record transform specifications. The model performance data from Keras is in the following format: # R code library(wrapr) df <- wrapr::build_frame( "val_loss" , "val_acc", "loss" ,...

2515 sym R (6281 sym/19 pcs) 2 img 2 tbl

Preparing Data for Supervised Classification

24.09.2019

Nina Zumel has been polishing up new vtreat for Python documentation and tutorials. They are coming out so good that I find to be fair to the R community I must start to back-port this new documentation to vtreat for R. vtreat is a package for systematically preparing data for supervised machine learning tasks such as classification or regressi...

2178 sym

How to Prepare Data

26.09.2019

Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables. In this note we describe some great tools for working with such data. For an examp...

3678 sym R (312 sym/1 pcs)

New vtreat Documentation (Starting with Multinomial Classification)

01.10.2019

Nina Zumel finished some great new documentation showing how to use Python vtreat to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R vtreat. So we now have good introductions on how to use vtreat to prepare data for the common tasks of: Regression: R regression example, Python regres...

1550 sym