Publications by Nina Zumel

Simpson’s Paradox in a Logistic Regression

06.02.2025

Simpson’s paradox is when a trend that is present in various groups of data seems to disappear or even reverse when those groups are combined. One sees examples of this often in things like medical trials, and the phenomenon is generally due to one or more unmodelled confounding variables, or perhaps differing causal assumptions. As part of a pro...

7591 sym R (5130 sym/12 pcs) 14 img 7 tbl

Dyson’s Algorithm: The General Case

23.01.2025

Photo by Marco Verch, CC-2.0. Source In a previous post, we looked at Dyson’s algorithm (Dyson 1946) for solving the Coins in Weighings problem : You have coins, to appearance exactly identical; but possibly one is counterfeit (we’ll call it a “dud”). You do not know if a dud is present, nor whether it is heavier or lighter than the go...

7880 sym R (2469 sym/21 pcs) 68 img 4 tbl

Dyson’s Algorithm for the Twelve Coins Problem

09.01.2025

Photo by Marco Verch, CC-2.0. Source The Twelve Coins Problem is a notoriously hard problem that comes in many flavors. I don’t know where it comes from originally, but it garnered quite a bit of attention from mathematicians in the mid-twentieth century. Apparently some versions of it may have even distracted scientists away from their defense ...

9637 sym R (1802 sym/16 pcs) 100 img 3 tbl

Post-hoc Adjustment for Zero-Thresholded Linear Models

16.08.2024

Suppose you are modeling a process that you believe is well approximated as being linear in its inputs, but only within a certain range. Outside that range, the output might saturate or threshold: for example if you are modeling a count or a physical process, you likely can never get a negative outcome. Similarly, a process can saturate to a upper ...

7276 sym R (7540 sym/14 pcs) 12 img 1 tbl

My Favorite Graphs

05.12.2011

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2 In this article, I will discuss some graphs tha...

6435 sym R (1691 sym/10 pcs) 12 img

Modeling Trick: Impact Coding of Categorical Variables with Many Levels

23.07.2012

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — name...

5484 sym R (6163 sym/12 pcs) 4 img

Error Handling in R

09.10.2012

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unan...

11058 sym 2 img

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

18.02.2013

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) g...

9462 sym R (4547 sym/12 pcs) 20 img

Bayesian and Frequentist Approaches: Ask the Right Question

06.05.2013

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, t...

12592 sym Python (3658 sym/6 pcs) 14 img

Big News! Practical Data Science with R is content complete!

19.12.2013

The last appendix has gone to the editors; the book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here. We look forward to sharing the fin...

950 sym 2 img