Publications by Nina Zumel
Post-hoc Adjustment for Zero-Thresholded Linear Models
Suppose you are modeling a process that you believe is well approximated as being linear in its inputs, but only within a certain range. Outside that range, the output might saturate or threshold: for example if you are modeling a count or a physical process, you likely can never get a negative outcome. Similarly, a process can saturate to a upper ...
7276 sym R (7540 sym/14 pcs) 12 img 1 tbl
My Favorite Graphs
The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2 In this article, I will discuss some graphs tha...
6435 sym R (1691 sym/10 pcs) 12 img
Modeling Trick: Impact Coding of Categorical Variables with Many Levels
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — name...
5484 sym R (6163 sym/12 pcs) 4 img
Error Handling in R
It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unan...
11058 sym 2 img
Revisiting Cleveland’s The Elements of Graphing Data in ggplot2
I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) g...
9462 sym R (4547 sym/12 pcs) 20 img
Bayesian and Frequentist Approaches: Ask the Right Question
It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, t...
12592 sym Python (3658 sym/6 pcs) 14 img
Big News! Practical Data Science with R is content complete!
The last appendix has gone to the editors; the book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here. We look forward to sharing the fin...
950 sym 2 img
The Extra Step: Graphs for Communication versus Exploration
Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical. One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of ...
8740 sym R (5436 sym/9 pcs) 18 img
The Statistics behind “Verification by Multiplicity”
There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here at Win-Vector, but we do somet...
2934 sym 2 img
Practical Data Science with R: Release date announced
It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in a...
1012 sym 4 img