Publications by John Mount

Level fit summaries can be tricky in R

01.10.2012

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.When modeling you often encounter what are commonly called categorical variables, which are called factors in R. Possible values of categorical variable...

9784 sym 2 img 2 tbl

Please stop using Excel-like formats to exchange data

07.12.2012

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain. I would like to make a plea to my fellow data scientists to stop using Ex...

8332 sym 2 img

Don’t use correlation to track prediction performance

22.02.2013

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root me...

3792 sym R (516 sym/3 pcs) 2 img

A bit more on sample size

08.03.2013

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least: This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick...

4709 sym R (839 sym/4 pcs) 12 img

Worry about correctness and repeatability, not p-values

05.04.2013

In data science work you often run into cryptic sentences like the following: Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend). (From “Associati...

16440 sym R (3437 sym/5 pcs) 4 img

Prefer = for assignment in R

23.04.2013

We share our opinion that = should be preferred to the more standard <- for assignment in R. This is from a draft of the appendix of our upcoming book. This has the risk of becoming an R version of Javascript’s semicolon controversy, but here you have it. R has five common assignment operators: “=“, “<-“, “->“, “<<-” and “-...

2818 sym R (219 sym/2 pcs) 2 img

A pathological glm() problem that doesn’t issue a warning

01.05.2013

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a bit patholog...

3286 sym R (1721 sym/2 pcs) 2 img

Big News! “Practical Data Science with R” MEAP launched!

15.05.2013

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print. Please subscribe to our ...

1276 sym 4 img

What is “Practical Data Science with R”?

22.06.2013

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others. What is Data Science? The statistician W...

8644 sym

Practical Data Science with R, deal of the day Aug 1 2013

31.07.2013

Deal of the Day August 1: Half off my book Practical Data Science with R. Use code dotd0801au at www.manning.com/zumel/ Related posts: Data Science, Machine Learning, and Statistics: what is in a name? Data science project planning Setting expectations in data science projects Related To leave a comment for the author, please follow the link a...

683 sym 2 img