Publications by arthur charpentier

On the “correlation” between a continuous and a categorical variable

04.04.2020

Let us get back on the Titanic dataset, loc_fichier = "http://freakonometrics.free.fr/titanic.RData" download.file(loc_fichier, "titanic.RData") load("titanic.RData") base = base[!is.na(base$Age),] On consider two variables, the age \(x\) (the continuous one) and the survivor indicator \(y\) (the qualitative one) X = base$Age Y = base$Survive...

2223 sym R (2234 sym/10 pcs) 6 img

Testing for Covid-19 in the U.S.

28.04.2020

For almost a month, on a daily basis, we are working with colleagues (Romuald, Chi and Mathieu) on modeling the dynamics of the recent pandemic. I learn of lot of things discussing with them, but we keep struggling with the tests. Paul, in Montréal, helped me a little bit, but I think we will still have to more to get a better understand. To but...

3846 sym R (1538 sym/6 pcs) 8 img

Regression discontinuity model for TV series

12.07.2020

In September, we are usually happy to see our favorite TV series back on air… Or not? Because admit it, if we are happy to see those characters back, most of the time, we are disappointed. So why not look at the data, to confirm this feeling? Nazareno Andrade shared some nice codes to get IMDB ratings in a nice csv file (you can either use the ...

2735 sym R (3034 sym/15 pcs) 26 img

Sharing pictures from holidays in the Canadian Rockies (with R)

09.08.2020

My kids have a very popular blog (at least among their grandmothers) where they frequently post pictures from everyday’s life (since they live 5000km from them), as well as pictures taken from holidays. This afternoon, I tried to used the popupImage function from the leaflet package to post pictures, on a map (to explain where we spent our holi...

1610 sym R (716 sym/6 pcs)

Trees and forests

30.11.2020

For my ACT6100 weekly quiz, I usually generate some datasets, and then ask students to compare various predictive algorithms. Last week, it was about classification trees and random forests. And students were surprised to have such differences (they had to estimate the probability to have a specific label, for the barycenter of the covariates). U...

1708 sym R (1123 sym/7 pcs) 6 img

Insurance Pricing Game

18.12.2020

Would you like to put your data science skills to the test? Imperial College London, Universite du Quebec à Montreal (UQAM), and actuarial institutes in Singapore, the UK, including the IFoA, and Australia, ASTIN, the Casualty Actuarial Society are co-organising a global data science competition. Would you like to accurately predict the cost of ...

1306 sym 2 img

Lilliefors, Kolmogorov-Smirnov and cross-validation

05.01.2021

In statistics, Kolmogorov–Smirnov test is a popular procedure to test, from a sample \(\{x_1,\cdots,x_n\}\) is drawn from a distribution \(F\), or usually \(F_{\theta_0}\), where \(F_{\theta}\) is some parametric distribution. For instance, we can test \(H_0:X_i\sim\mathcal{N(0,1)}\) (where \(\theta_0=(\mu_0,\sigma_0^2)=(0,1)\)) using that test...

4287 sym R (1734 sym/12 pcs) 10 img

3rd Insurance Data Science Conference

25.01.2021

Registrations and call for abstracts, for the 3rd Insurance Data Science Conference, organised on-line 16 – 18 June 2021 (PM in Europe, AM in America), are now open. See https://insurancedatascience.org/ for more details… Related To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometr...

632 sym 2 img

Some general thoughts on Partial Dependence Plots with correlated covariates

12.02.2021

The partial dependence plot is a nice tool to analyse the impact of some explanatory variables when using nonlinear models, such as a random forest, or some gradient boosting.The idea (in dimension 2), given a model \(m(x_1,x_2)\) for \(\mathbb{E}[Y|X_1=x_1,X_2=x_2]\). The partial dependence plot for variable \(x_1\) is model \(m\) is function \(...

3021 sym R (1306 sym/10 pcs) 12 img

From multinomial regression to binary classification on some Siamese data

14.03.2021

There are two kinds of people in the world: people who think there are two kinds of people in the world and people who don’t (borrowed from Menand (2018)). Because things are always simpler when we face only binary choice, aren’t they? But consider here the case were multiple options are possible, and let us see if we cannot get back to simpl...

7930 sym R (5447 sym/2 pcs) 8 img