Publications by John Mount

R has some sharp corners

15.05.2014

R is definitely our first choice go-to analysis system. In our opinion you really shouldn’t use something else until you have an articulated reason (be it a need for larger data scale, different programming language, better data source integration, or something else). The advantages of R are numerous: Single integrated work environment. Powe...

6628 sym 2 img

Save 45% on Practical Data Science with R (expires May 21, 2013)

16.05.2014

Please share this generous deal from Manning publications: save 45% on Practical Data Science with R through May 21, 2014. Please tweet, forward and share! Related posts: A bit of the agenda of Practical Data Science with R Data Science, Machine Learning, and Statistics: what is in a name? Data science project planning Related To leave a c...

730 sym 2 img

How does Practical Data Science with R stand out?

02.06.2014

There are a lot of good books on statistics, machine learning, analytics, and R. So it is valid to ask: how does Practical Data Science with R stand out? Why should a data scientist or an aspiring data scientist buy it? We admit, it isn’t the only book we own. Some relevant books from the Win-Vector LLC company library include: And a few mo...

6365 sym 6 img

R style tip: prefer functions that return data frames

06.06.2014

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return data.frames. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages. The usual mental model...

4558 sym 2 img

R minitip: don’t use data.matrix when you mean model.matrix

10.06.2014

A quick R mini-tip: don’t use data.matrix when you mean model.matrix. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding). For some modeling tasks you end up having to prepare a special expanded data matrix before calling a given machine learning algorithm. For example the randomFores...

5364 sym 2 img 2 tbl

Frequentist inference only seems easy

01.07.2014

Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introdu...

30492 sym R (197 sym/1 pcs) 16 img

Automatic bias correction doesn’t fix omitted variable bias

04.07.2014

Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what migh...

21589 sym 10 img

Reading the Gauss-Markov theorem

26.08.2014

What is the Gauss-Markov theorem? From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition: A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of ha...

19338 sym R (2931 sym/1 pcs) 6 img

Factors are not first-class citizens in R

23.09.2014

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R...

20196 sym 6 img

Excel spreadsheets are hard to get right

18.11.2014

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we Excel-like-formats-to-exchange-data/”>don’t recommend using Excel-line formats to exchange data. Bu...

6074 sym 16 img