Publications by John Mount
data.table is Much Better Than You Have Been Told
There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table commands, to take advantage of data.table‘s superior performance. Obviously if one wants to use data.table it is best to learn data.table. But if we want code that can run multiple places a translation layer may be ...
4025 sym R (607 sym/2 pcs) 4 img
My Favorite data.table Feature
My favorite R data.table feature is the “by” grouping notation when combined with the := notation. Let’s take a look at this powerful notation. First, let’s build an example data.frame. d <- wrapr::build_frame( "group" , "value" | "a" , 1L | "a" , 2L | "b" , 3L | "b" , 4L ) knitr::...
2125 sym R (1232 sym/7 pcs) 1 tbl
Replicating a Linear Model
For a few of my commercial projects I have been in the seemingly strange place being asked to port a linear model from one data science system to another. Now I try to emphasize that it is better going forward to port procedures and build new models with training data. But sometimes that is not possible. Solving this problem for linear and logist...
3893 sym R (1268 sym/18 pcs) 4 tbl
Programming Over lm() in R
Here is simple modeling problem in R. We want to fit a linear model where the names of the data columns carrying the outcome to predict (y), the explanatory variables (x1, x2), and per-example row weights (wt) are given to us as strings. Lets start with our example data and parameters. The point is: we are assuming the data and parameters come t...
4285 sym R (1964 sym/18 pcs) 1 tbl
Some Details on Running xgboost
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation). In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number ...
4931 sym R (1734 sym/6 pcs) 4 tbl
Big News: Porting vtreat to Python
We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to thei...
1693 sym
R Books Discount!
We, the community of Manning R and data science authors, have talked Manning into offering a catalog-wide 40% discount on all books. Please take a look at some great deals on some great technical books here: http://mng.bz/adRj ! Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog....
632 sym 2 img
A Kind Note That We Really Appreciate
The following really made my day. I tell every data scientist I know about vtreat and urge them to read the paper. Jason Wolosonovich Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this). For those interested the R version of vtreat can be found here, the paper can b...
1034 sym
A Comment on Data Science Integrated Development Environments
A point that differs from our experience struck us in the recent note: A development environment specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist. “What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh Actually, Python has a large number...
2381 sym
Some Notes on GNU Licenses in R Packages
I was recently asked if Win-Vector LLC would move the R wrapr package from a GPL-3 license to an LGPL license. In the end I decided to move wrapr distribution to a “GPL-2 | GPL-3” license. This means the package is now available under both GPL-2 and GPL-3 licensing, allowing the user to pick which of these two licenses they wish to accept th...
4431 sym 2 img