Publications by Jacob Simmering
The Problem with Propensity Scores
Are Propensity Scores Useful? Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to a higher risk of treatment in thei...
11784 sym R (8187 sym/21 pcs) 16 img
Parallel Simulation of Heckman Selection Model
Parallel Simulation of Heckman Selection Model One of the, if not the, fundamental problems in observational data analysis is the estimation of the value of the unobserved choice. If the (i^{text{th}}) unit chooses the value of (t) on the basis of some factors (mathbf{x_i}), which may include the (E(u_i(t))) for that unit, comparing the outcome (...
16327 sym R (4750 sym/9 pcs) 10 img
Easy Cross Validation in R with `modelr`
When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs very well on a data set may perform poorly when used more generally. ...
10702 sym R (4490 sym/7 pcs) 4 img
Using tidytext to make sentiment analysis easy
Last week I discovered the R package tidytext and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.” It certainly helped that a lot of the examples are from Pride and Prejudice and other books by Jane Austen, my most beloved ...
5879 sym R (4249 sym/7 pcs) 8 img
Inter-ocular trauma test
I’ve recently been thinking about the role statistics can play in answering questions. I think the it came up on the NSSD podcast a few weeks ago. Basically, problems can be divided into three classes: those that don’t need statistics because the answer is obvious (problems without much confounding and a strong signal to noise ratio) those t...
3378 sym 2 img
readr::problems() returns tidy data!
A handy little trick I picked up today when using readr. Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, population, area by land/water type) and the FIPS codes for the s...
1893 sym R (1290 sym/2 pcs)
Poor Donald – his tweets keep getting more negative
Last summer, David Robinson did this interesting text analysis of Donald Trump’s tweets and found that they more angry ones came from Android (which Trump is known to use). But he didn’t consider how Trump’s emotional state varies over time and he certainly couldn’t have considered what the impact of the election and recent events would h...
3005 sym R (10273 sym/12 pcs) 8 img