Publications by kjytay
A simple probabilistic algorithm for estimating the number of distinct elements in a data stream
I just came across a really interesting and simple algorithm for estimating the number of distinct elements in a stream of data. The paper (Chakraborty et al. 2023) is available on arXiv; see this Quanta article (Reference 2) for a layman’s explanation. Problem statement Let’s state the problem formally. Let’s say we are given a stream where...
5960 sym R (1044 sym/3 pcs) 87 img
Understanding leaf node numbers when using rpart and rpart.rules
I recently ran into an issue with matching rules from a decision tree (output of rpart.plot::rpart.rules()) with leaf node numbers from the tree object itself (output of rpart::rpart()). This post explains the issue and how to solve it. First, let’s build a decision tree model and print its tree representation: library(rpart.plot) data(ptitan...
3416 sym R (2980 sym/8 pcs) 8 img
A short note on the startsWith function
The startsWith function comes with base R, and determines whether entries of an input start with a given prefix. (The endsWith function does the same thing but for suffixes.) The following code checks if each of “ant”, “banana” and “balloon” starts with “a”: startsWith(c("ant", "banana", "balloon"), "a") # [1] TRUE FALSE FALSE ...
1676 sym R (465 sym/4 pcs)
A quirk when using data.table?
I recently came across this quirk in using data.table that I don’t really have a clean solution for. I outline the issue below as well as my current way around it. Appreciate any better solutions! The problem surfaces quite generally, but I’ll illustrate it by trying to achieve the following task: write a function that takes a data table and ...
1818 sym R (892 sym/7 pcs) 4 img
Slight inconsistency between forcats’ fct_lump_min and fct_lump_prop
I recently noticed a slight inconsistency between the forcats package’s fct_lump_min and fct_lump_prop functions. (I’m working with v0.5.1, which is the latest version at the time of writing.) These functions lump levels that meet a certain criteria into an “other” level. According to the documentation, fct_lump_min “lumps levels that ...
1999 sym R (422 sym/4 pcs)
Draft position for players in the NBA for the 2020-21 season
When the 2022 NBA draft happened almost a month ago, I thought to myself: do players picked earlier in the draft (i.e. higher-ranked) actually end up having better/longer careers? If data wasn’t an issue, the way I would do it would be to look at players chosen in the draft lottery (top 60 picks) in the past 10/20 years. For each player, I woul...
5297 sym R (3881 sym/11 pcs) 14 img
Math expressions in R plots
Did you know that you can include math expressions in your text annotations of figures in R? This webpage has an excellent tutorial on how to do so. In essence, you can put your plot label (with some syntax) in the function expression(). Here’s a little example of that: par(mar = c(5, 5, 4, 2)) x <- seq(0, 5, length.out = 500) plot(x, sin(x^2...
682 sym R (183 sym/1 pcs) 2 img
Two common mistakes with the colon operator in R
R has a colon operator which makes it really easy to define a sequence of integers. For example, the code 1:10 generates a vector of consisting of the integers from 1 to 10 (inclusive). However, using the colon operator is not without its pitfalls! I will highlight two common mistakes here. First, imagine that you have a variable n which has valu...
2498 sym R (218 sym/6 pcs) 8 img
Is soccer more a game of chance, or a game of skill?
With the FIFA World Cup in my recent memory and the English Premier League (EPL) kicking off this Friday (see here for match schedules), I’ve been thinking a bit about the mathematics/statistics of the beautiful game. In this post, I want to answer the following question: Is soccer more a game of chance, or a game of skill? I’m interested in...
4453 sym 8 img
Clustering EPL teams using k-means clustering
I recently got a hold of team rankings for the English Premier League (EPL) for the last 10 years (data was manually recorded from this Google sheet, available here in .csv format). I thought this would be a good opportunity to test out clustering of EPL teams and to answer the question: is there a group of teams which is a cut above the rest? A...
3545 sym 14 img