Publications by arthur charpentier

Visualising a Classification in High Dimension, part 2

09.04.2015

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a cor...

3001 sym R (3084 sym/12 pcs) 6 img

I Fought the (distribution) Law (and the Law did not win)

27.04.2015

A few days ago, I was asked if we should spend a lot of time to choose the distribution we use, in GLMs, for (actuarial) ratemaking. On that topic, I usually claim that the family is not the most important parameter in the regression model. Consider the following dataset > db <- data.frame(x=c(1,2,3,4,5),y=c(1,2,4,2,6)) > plot(db,xlim=c(0,6),yli...

3423 sym R (3049 sym/17 pcs) 38 img

Working with “large” datasets, with dplyr and data.table

04.05.2015

A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, “we should forget about small efficiencies, say about 97% of the time: prema...

5164 sym R (6775 sym/33 pcs) 6 img

Copulas and Financial Time Series

12.05.2015

I was recently asked to write a survey on copulas for financial time series. The paper is, so far, unfortunately, in French, and is available on https://hal.archives-ouvertes.fr/. There is a description of various models, including some graphs and statistical outputs, obtained from read data. To illustrate, I’ve been using weekly log-returns of...

2874 sym R (7153 sym/13 pcs) 32 img

Data Science: from Small to Big Data

29.05.2015

This Tuesday, I will be in Leuven (in Belgium) at the ACP meeting to give a talk on Data Science: from Small to Big Data. The talk will take place in the Faculty Club from 6 till 8 pm. Slides could be found online (with animated pictures). As usual, comments are welcome. Related To leave a comment for the author, please follow the link...

685 sym 2 img

Who interacts on Twitter during a conference (#JDSLille)

07.06.2015

Disclamer: This is a joint post with Avner Bar-Hen, a.k.a. @a_bh, Benjamin Guedj, a.k.a. @bguedj and Nathalie Villa, a.k.a. @Natty_V2 Organised annually since 1970 by the French Society of Statistics (SFdS), the Journées de Statistique (JdS) are the most important scientific event of the French statistical community. More than 400 researchers, ...

3135 sym R (5536 sym/14 pcs) 16 img

p-hacking, or cheating on a p-value

11.06.2015

Yesterday evening, I discovered some interesting slides on False-Positives, p-Hacking, Statistical Power, and Evidential Value, via @UCBITSS ‘s post on Twitter. More precisely, there was this slide on how cheating (because that’s basically what it is) to get a ‘good’ model (by targeting the p-value) As mentioned by @david_colquhoun ...

2690 sym R (988 sym/8 pcs) 10 img

‘Variable Importance Plot’ and Variable Selection

17.06.2015

Classification trees are nice. They provide an interesting alternative to a logistic regression. I started to include them in my courses maybe 7 or 8 years ago. The question is nice (how to get an optimal partition), the algorithmic procedure is nice (the trick of splitting according to one variable, and only one, at each node, and then to ...

3710 sym R (2766 sym/11 pcs) 42 img

An Attempt to Understand Boosting Algorithm(s)

26.06.2015

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page. Boosting is a machine learning ensembl...

6073 sym R (2654 sym/13 pcs) 46 img

Variable Selection using Cross-Validation (and Other Techniques)

01.07.2015

A natural technique to select variables in the context of generalized linear models is to use a stepŵise procedure. It is natural, but contreversial, as discussed by Frank Harrell in a great post, clearly worth reading. Frank mentioned about 10 points against a stepwise procedure. It yields R-squared values that are badly biased to be high...

6886 sym R (12066 sym/26 pcs) 50 img