Publications by David Lindelöf
How Are P-values Distributed Under The Null?
I sometimes use this fun interview question for aspiring data scientists: How are p-values distributed assuming the null hypothesis is true? I’ve heard a lot of reasonable answers, including: It should be centered towards large values it should have almost zero mass below 0.05 It depends on the model It depends on the null hypothesis All very...
3997 sym R (402 sym/2 pcs) 6 img
Your Classifier Is Broken, But It Is Still Useful
When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence. But that estimate is biased, because no classifier is perfect. For example, if your classifier tells you that you have 20% of positive cases, but its precision is known to be only 50%, you w...
3958 sym R (670 sym/1 pcs) 2 img
Is The Ratio of Normal Variables Normal?
In Trustworthy Online Controller Experiments I came across this quote, referring to a ratio metric $M = \frac{X}{Y}$, which states that: Because $X$ and $Y$ are jointly bivariate normal in the limit, $M$, as the ratio of the two averages, is also normally distributed. That’s only partially true. According to https://en.wikipedia.org/wiki/Ratio_...
1982 sym R (405 sym/7 pcs) 8 img
Controlling for covariates is not the same as “slicing”
To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response. I recently gave a talk at our company on the design of online exp...
3839 sym R (1850 sym/6 pcs) 4 img
Bayesian tanks
The frequentist vs bayesian debate has plagued the scientific community for almost a century now, yet most of the arguments I’ve seen seem to involve philosophical considerations instead of hard data. Instead of letting the sun explode, I propose a simpler experiment to assess the performance of each approach. The problem reads as follows (tak...
3345 sym R (679 sym/4 pcs) 4 img
Book review: Advanced R
I would like to call this the best second book on R, except that I wouldn’t know what the first one would be. I learned R from classes and tutorials about 10 years ago, used it on my PhD and four articles, and use it today on a daily basis at work; yet only now, after reading this book, do I feel like I could possibly be called an R programmer ...
1944 sym 4 img
Biblical kings and boxplots
When you read through the biblical books of Kings, you may have been struck by a phrase that repeats itself for every monarch: In the Xth year of (king of kingdom B), (name of king) became king of (kingdom A). He reigned N years, and did (evil|good) in the sight of the Lord. If you’ve read through these books several times, you will probably ...
2291 sym R (985 sym/4 pcs) 2 img
The opinionated estimator
You have been lied to. By me. I taught once a programming class and introduced my students to the notion of an unbiased estimator of the variance of a population. The problem can be stated as follows: given a set of observations $(x_1, x_2, …, x_n)$, what can you say about the variance of the population from which this sample is drawn? Classic...
3488 sym R (688 sym/5 pcs) 4 img
Connecting to SQL Server from R on a Mac with a Windows domain user
Connecting to an SQL Server instance as a Windows domain user is relatively straightforward when you run R on Windows, you have the right ODBC driver installed, and your network is setup properly. You normally don’t need to supply credentials, because the ODBC driver uses the built-in Windows authentication scheme. Assuming your odbcinst.ini fi...
2175 sym R (457 sym/3 pcs)
R collation order
You need to declare generic functions in S4 before you can define methods for them. If no definition exists you will see the following error: > setClass("track", slots = c(x="numeric", y = "numeric")) > setMethod("foo", signature(x = "track")) Error in setMethod("foo", signature(x = "track"), definition = function(x) cat(x)) : no existing defi...
2360 sym R (503 sym/4 pcs)