Publications by r on Tony ElHabr
Comparing Variable Importance Functions (For Modeling)
I’ve been doing some machine learning recently, and one thing that keeps popping up is the need to explain the models and their components. There are a variety of ways to go about explaining model features, but probably the most common approach is to use variable (or feature) importance scores. Unfortunately, computing variable importance score...
14041 sym R (16346 sym/1 pcs) 10 img
S3 Classes and {vctrs} to Create a Soccer Pitch Control Model
Intro There’s never been a better time to be involved in sports analytics. There is a wealth of open-sourced data and code (not to mention well-researched and public analysis) to digest and use. Both people working for teams and people just doing at as a hobby are publishing new and interesting analyses every day. In particular, the FriendsOfTr...
17763 sym R (6101 sym/11 pcs) 14 img
Decomposition and Smoothing with data.table, reticulate, and spatstat
While reading up on modern soccer analytics (I’ve had an itch for soccer and tracking data recently, I stumbled upon an excellent set of tutorials written by Devin Pleuler. In particular, his notebook on non-negative matrix factorization (NNMF) caught my eye. I hadn’t really heard of the concept before, but it turned out to be much less daunt...
11930 sym R (7982 sym/11 pcs) 10 img
Fantasy Football and the Classical Scheduling Problem
Introduction Every year I play in several fantasy football (American) leagues. For those who are unaware, it’s a game that occurs every year in sync with the National Football League (NFL) where participants play in weekly head-to-head games as general managers of virtual football teams. (Yes, it’s very silly.) The winner at the end of the se...
7635 sym R (5586 sym/7 pcs) 4 img 2 tbl
Quantifying Relative Soccer League Strength
Introduction Arguing about domestic league strength is something that soccer fans seems to never tire of. (“Could Messi do it on a cold rainy night in Stoke?”) Many of these conversations are anecdotal, leading to “hot takes” that are unfalsifiable. While we’ll probably never move away from these kinds of discussions, we can at least tr...
12171 sym R (9068 sym/9 pcs) 10 img
Tired: PCA + kmeans, Wired: UMAP + GMM
Introduction Combining principal component analysis (PCA) and kmeans clustering seems to be a pretty popular 1-2 punch in data science. While there is some debate about whether combining dimensionality reduction and clustering is something we should ever do1, I’m not here to debate that. I’m here to illustrate the potential advantages of upgr...
11822 sym R (5912 sym/8 pcs) 26 img