Publications by John Mount
What is vtreat?
vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other ...
2425 sym 2 img
Introducing data_algebra
This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable)...
12441 sym R (11535 sym/36 pcs) 2 img 4 tbl
It is Time for CRAN to Ban Package Ads
NPM (a popular Javascript package repository) just banned package advertisements. I feel the CRAN repository should do the same. Not all R-users are fully aware of package advertisements. But they clutter up work, interfere with reproducibility, and frankly are just wrong. Here is an example which could be considered to contain advertisements: ...
2538 sym
Why R?
I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book. In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment. This is why w...
811 sym
Advanced Data Reshaping in Python and R
This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The user specifies their desired t...
7438 sym R (6478 sym/26 pcs) 5 tbl
Practical Data Science with R update
Just got the following note from a new reader: Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands. Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required...
995 sym
The Advantages of Record Transform Specifications
Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R. I will use this example to show some of the advantages of cdata record transform specifications. The model performance data from Keras is in the following format: # R code library(wrapr) df <- wrapr::build_frame( "val_loss" , "val_acc", "loss" ,...
2515 sym R (6281 sym/19 pcs) 2 img 2 tbl
Preparing Data for Supervised Classification
Nina Zumel has been polishing up new vtreat for Python documentation and tutorials. They are coming out so good that I find to be fair to the R community I must start to back-port this new documentation to vtreat for R. vtreat is a package for systematically preparing data for supervised machine learning tasks such as classification or regressi...
2178 sym
How to Prepare Data
Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables. In this note we describe some great tools for working with such data. For an examp...
3678 sym R (312 sym/1 pcs)
New vtreat Documentation (Starting with Multinomial Classification)
Nina Zumel finished some great new documentation showing how to use Python vtreat to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R vtreat. So we now have good introductions on how to use vtreat to prepare data for the common tasks of: Regression: R regression example, Python regres...
1550 sym