Publications by arthur charpentier
On the “correlation” between a continuous and a categorical variable
Let us get back on the Titanic dataset, loc_fichier = "http://freakonometrics.free.fr/titanic.RData" download.file(loc_fichier, "titanic.RData") load("titanic.RData") base = base[!is.na(base$Age),] On consider two variables, the age \(x\) (the continuous one) and the survivor indicator \(y\) (the qualitative one) X = base$Age Y = base$Survive...
2223 sym R (2234 sym/10 pcs) 6 img
Testing for Covid-19 in the U.S.
For almost a month, on a daily basis, we are working with colleagues (Romuald, Chi and Mathieu) on modeling the dynamics of the recent pandemic. I learn of lot of things discussing with them, but we keep struggling with the tests. Paul, in Montréal, helped me a little bit, but I think we will still have to more to get a better understand. To but...
3846 sym R (1538 sym/6 pcs) 8 img
Regression discontinuity model for TV series
In September, we are usually happy to see our favorite TV series back on air… Or not? Because admit it, if we are happy to see those characters back, most of the time, we are disappointed. So why not look at the data, to confirm this feeling? Nazareno Andrade shared some nice codes to get IMDB ratings in a nice csv file (you can either use the ...
2735 sym R (3034 sym/15 pcs) 26 img
Sharing pictures from holidays in the Canadian Rockies (with R)
My kids have a very popular blog (at least among their grandmothers) where they frequently post pictures from everyday’s life (since they live 5000km from them), as well as pictures taken from holidays. This afternoon, I tried to used the popupImage function from the leaflet package to post pictures, on a map (to explain where we spent our holi...
1610 sym R (716 sym/6 pcs)
Trees and forests
For my ACT6100 weekly quiz, I usually generate some datasets, and then ask students to compare various predictive algorithms. Last week, it was about classification trees and random forests. And students were surprised to have such differences (they had to estimate the probability to have a specific label, for the barycenter of the covariates). U...
1708 sym R (1123 sym/7 pcs) 6 img
Insurance Pricing Game
Would you like to put your data science skills to the test? Imperial College London, Universite du Quebec à Montreal (UQAM), and actuarial institutes in Singapore, the UK, including the IFoA, and Australia, ASTIN, the Casualty Actuarial Society are co-organising a global data science competition. Would you like to accurately predict the cost of ...
1306 sym 2 img
Lilliefors, Kolmogorov-Smirnov and cross-validation
In statistics, Kolmogorov–Smirnov test is a popular procedure to test, from a sample \(\{x_1,\cdots,x_n\}\) is drawn from a distribution \(F\), or usually \(F_{\theta_0}\), where \(F_{\theta}\) is some parametric distribution. For instance, we can test \(H_0:X_i\sim\mathcal{N(0,1)}\) (where \(\theta_0=(\mu_0,\sigma_0^2)=(0,1)\)) using that test...
4287 sym R (1734 sym/12 pcs) 10 img
3rd Insurance Data Science Conference
Registrations and call for abstracts, for the 3rd Insurance Data Science Conference, organised on-line 16 – 18 June 2021 (PM in Europe, AM in America), are now open. See https://insurancedatascience.org/ for more details… Related To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometr...
632 sym 2 img
Some general thoughts on Partial Dependence Plots with correlated covariates
The partial dependence plot is a nice tool to analyse the impact of some explanatory variables when using nonlinear models, such as a random forest, or some gradient boosting.The idea (in dimension 2), given a model \(m(x_1,x_2)\) for \(\mathbb{E}[Y|X_1=x_1,X_2=x_2]\). The partial dependence plot for variable \(x_1\) is model \(m\) is function \(...
3021 sym R (1306 sym/10 pcs) 12 img
From multinomial regression to binary classification on some Siamese data
There are two kinds of people in the world: people who think there are two kinds of people in the world and people who don’t (borrowed from Menand (2018)). Because things are always simpler when we face only binary choice, aren’t they? But consider here the case were multiple options are possible, and let us see if we cannot get back to simpl...
7930 sym R (5447 sym/2 pcs) 8 img