Publications by kjytay
All the (NBA) box scores you ever wanted
In this previous post, I showed how one can scrape top-level NBA game data from BasketballReference.com. In the post after that, I demonstrated how to scrape play-by-play data for one game. After writing those posts, I thought to myself: why not do both? And that is what I did: scrape all the box scores for the 2017-18 NBA season and save them to...
8054 sym R (15606 sym/16 pcs) 22 img
Using emojis as scatterplot points
Recently I wanted to learn how to use emojis as points in a scatterplot points. It seems like the emojifont package is a popular way to do it. However, I couldn’t seem to get it to work on my machine (perhaps I need to install the font manually?). The other package I found was emoGG; this post shows how to use this package. (For another example...
2083 sym R (2067 sym/8 pcs) 8 img
A deep dive into glmnet: offset
I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation. In this post, we will look at the offset option. For reference, here is the full signature of the glmnet function: glmnet(x, y, family=c("gaussian","binomial","p...
4415 sym R (946 sym/3 pcs) 92 img
pcLasso: a new method for sparse regression
I’m excited to announce that my first package has been accepted to CRAN! The package pcLasso implements principal components lasso, a new method for sparse regression which I’ve developed with Rob Tibshirani and Jerry Friedman. In this post, I will give a brief overview of the method and some starter code. (For an in-depth description and ela...
3866 sym R (490 sym/5 pcs) 82 img
Quantile regression in R
Quantile regression: what is it? Let be some response variable of interest, and let be a vector of features or predictors that we want to use to model the response. In linear regression, we are trying to estimate the conditional mean function, , by a linear combination of the features. While the conditional mean function is often what we want t...
5713 sym R (3286 sym/8 pcs) 37 img
Plots within plots with ggplot2 and ggmap
Once in a while, you might find yourself wanting to embed one plot within another plot. ggplot2 makes this really easy with the annotation_custom function. The following example illustrates how you can achieve this. (For all the code in one R file, click here.) Let’s generate some random data and make a scatterplot along with a smoothed estimat...
2152 sym R (1634 sym/6 pcs) 16 img
Many ways to do the same thing: linear regression
One feature of R (could be positive, could be negative) is that there are many ways to do the same thing. In this post, I list out the different ways we can get certain results from a linear regression model. Feel free to comment if you know more ways other than those listed! In what follows, we will use the linear regression object lmfit: data(m...
891 sym R (539 sym/4 pcs)
Testing numeric variables for NA/NaN/Inf
In R, a numeric variable is either a number (like 0, 42, or -3.14), or one of 4 special values: NA, NaN, Inf or -Inf. It can be hard to remember how the is.x functions treat each of the special values, especially NA and NaN! The table below summarizes how each of these values is treated by different base R functions. Functions are listed in alpha...
1089 sym 1 tbl
The sinh-arcsinh normal distribution
This month’s issue of Significance magazine has a very nice summary article of the sinh-arcsinh normal distribution. (Unfortunately, the article seems to be behind a paywall.) This distribution was first introduced by Chris Jones and Arthur Pewsey in 2009 as a generalization of the normal distribution. While the normal distribution is symmetric...
2991 sym R (886 sym/4 pcs) 61 img
Two interesting facts about high-dimensional random projections
John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them. Fact 1: Two randomly chosen vectors in a high-dimensional space are very...
3371 sym R (927 sym/5 pcs) 28 img