Publications by Nina Zumel

Post-hoc Adjustment for Zero-Thresholded Linear Models

16.08.2024

Suppose you are modeling a process that you believe is well approximated as being linear in its inputs, but only within a certain range. Outside that range, the output might saturate or threshold: for example if you are modeling a count or a physical process, you likely can never get a negative outcome. Similarly, a process can saturate to a upper ...

7276 sym R (7540 sym/14 pcs) 12 img 1 tbl

My Favorite Graphs

05.12.2011

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2 In this article, I will discuss some graphs tha...

6435 sym R (1691 sym/10 pcs) 12 img

Modeling Trick: Impact Coding of Categorical Variables with Many Levels

23.07.2012

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — name...

5484 sym R (6163 sym/12 pcs) 4 img

Error Handling in R

09.10.2012

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unan...

11058 sym 2 img

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

18.02.2013

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) g...

9462 sym R (4547 sym/12 pcs) 20 img

Bayesian and Frequentist Approaches: Ask the Right Question

06.05.2013

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, t...

12592 sym Python (3658 sym/6 pcs) 14 img

Big News! Practical Data Science with R is content complete!

19.12.2013

The last appendix has gone to the editors; the book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here. We look forward to sharing the fin...

950 sym 2 img

The Extra Step: Graphs for Communication versus Exploration

12.01.2014

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical. One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of ...

8740 sym R (5436 sym/9 pcs) 18 img

The Statistics behind “Verification by Multiplicity”

01.03.2014

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here at Win-Vector, but we do somet...

2934 sym 2 img

Practical Data Science with R: Release date announced

25.03.2014

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in a...

1012 sym 4 img