Publications by John Mount
Returning to Tides
Fred Viole shared a great “data only” R solution to the forecasting tides problem. The methodology comes from a finance perspective, and has some great associated notes and articles. This gives me a chance to comment on the odd relation between prediction and profit in finance. If there really was a trade-able item with low trade costs and ...
1623 sym 2 img
vtreat up on PyPi
I am excited to announce vtreat is now available for Python on PyPi, in addition for R on CRAN. vtreat is: A data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. vtreat prepares variables so that data has fewer exceptional cases, making it easier to safely use models in producti...
1276 sym 6 img
Speaking at BARUG
We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us. Nina Zumel & John Mount Practical Data Science with R Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data Science using R. We have been working hard on an improved ...
1036 sym
What is vtreat?
vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other ...
2425 sym 2 img
Introducing data_algebra
This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable)...
12441 sym R (11535 sym/36 pcs) 2 img 4 tbl
It is Time for CRAN to Ban Package Ads
NPM (a popular Javascript package repository) just banned package advertisements. I feel the CRAN repository should do the same. Not all R-users are fully aware of package advertisements. But they clutter up work, interfere with reproducibility, and frankly are just wrong. Here is an example which could be considered to contain advertisements: ...
2538 sym
Why R?
I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book. In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment. This is why w...
811 sym
Advanced Data Reshaping in Python and R
This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The user specifies their desired t...
7438 sym R (6478 sym/26 pcs) 5 tbl
Practical Data Science with R update
Just got the following note from a new reader: Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands. Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required...
995 sym
The Advantages of Record Transform Specifications
Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R. I will use this example to show some of the advantages of cdata record transform specifications. The model performance data from Keras is in the following format: # R code library(wrapr) df <- wrapr::build_frame( "val_loss" , "val_acc", "loss" ,...
2515 sym R (6281 sym/19 pcs) 2 img 2 tbl