Publications by kjytay

Two interesting facts about high-dimensional random projections

16.04.2019

John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them. Fact 1: Two randomly chosen vectors in a high-dimensional space are very...

3371 sym R (927 sym/5 pcs) 28 img

Probability of winning a best-of-7 series

22.04.2019

The NBA playoffs are in full swing! A total of 16 teams are competing in a playoff-format competition, with the winner of each best-of-7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we c...

3931 sym R (1493 sym/5 pcs) 26 img

Probability of winning a best-of-7-series (part 2)

25.04.2019

In this previous post, I explored the probability that a team wins a best-of-n series, given that its win probability for any one game is some constant . As one commenter pointed out, most sports models consider the home team to have an advantage, and this home advantage should affect the probability of winning a series. In this post, I will expl...

4650 sym R (4242 sym/8 pcs) 39 img

Sampling paths from a Gaussian process

07.07.2019

Gaussian processes are a widely employed statistical tool because of their flexibility and computational tractability. (For instance, one recent area where Gaussian processes are used is in machine learning for hyperparameter optimization.) A stochastic process is a Gaussian process if (and only if) any finite subcollection of random variables ...

5061 sym R (4137 sym/9 pcs) 102 img

Looking at flood insurance claims with choroplethr

14.07.2019

I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims. I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than 2 million claims transactions...

7870 sym R (5852 sym/17 pcs) 32 img

Be careful of NA/NaN/Inf values when using base R’s plotting functions!

13.08.2019

I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when tr...

1890 sym R (550 sym/3 pcs) 10 img

Changing the variable inside an R formula

23.08.2019

I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset: data(mtcars) head(mtcars) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 ...

2095 sym R (1837 sym/7 pcs)

Visualizing the relationship between multiple variables

24.08.2019

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs()...

3676 sym R (6267 sym/8 pcs) 14 img

Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

25.08.2019

When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this: It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortc...

930 sym 2 img

Lesser known dplyr functions

30.08.2019

The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself): filter() to select cases based on their values. arrange() to reorder the cases. select() and rename() to select variables based on their names....

7054 sym R (3791 sym/12 pcs)