Publications by arthur charpentier
How Could Classification Trees Be So Fast on Categorical Variables?
I think that over the past months, I have been saying non-correct things about classification with categorical covariates. Because I never took time to look at it carefuly. Consider some simulated dataset, with a logistic regression, > n=1e3 > set.seed(1) > X1=runif(n) > q=quantile(X1,(0:26)/26) > q[1]=0 > X2=cut(X1,q,labels=LETTERS[1:26]) > p=e...
2276 sym R (2023 sym/18 pcs) 32 img
Regression with Splines: Should we care about Non-Significant Components?
Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression. Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant?...
2830 sym R (2713 sym/11 pcs) 16 img
Confidence Regions for Parameters in the Simplex
Consider here the case where, in some parametric inference problem, parameter is a point in the Simplex, For instance, consider some regression, on compositional data, > library(compositions) > data(DiagnosticProb) > Y=DiagnosticProb[,"type"]-1 > X=DiagnosticProb[,c("A","B","C")] > model = glm(Y~ilr(X),family=binomial) > b = ilrInv(coef(m...
1006 sym R (1944 sym/5 pcs) 8 img
Simple Distributions for Mixtures?
The idea of GLMs is that given some covariates, has a distribution in the exponential family (Gaussian, Poisson, Gamma, etc). But that does not mean that has a similar distribution… so there is no reason to test for a Gamma model for before running a Gamma regression, for instance. But are there cases where it might work? That the non-...
3167 sym R (542 sym/3 pcs) 64 img
Clusters of (French) Regions
For the data scienec course of tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections > elections2012=read.table( "http://freakonometrics.free.fr/elections_2012_T1.csv",sep=";",dec=",",header=TRUE) > voix=which(substr(names( + elections2012),1,11)=="X..Voix.Exp") > election...
948 sym R (1702 sym/7 pcs) 12 img
Clusters of Texts
Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example, inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics, > library(tm) > library(stringi) > library(proxy) > titles = c("Boosting_(machine_learning)", + ...
2032 sym R (2807 sym/9 pcs) 4 img
Clustering French Cities (based on Temperatures)
In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities. > temp=read.table( + "http://freakonometrics.free.fr/FR_temp.txt", + header=TRUE,dec=",") We have 15 cities, with monthly observations > X=temp[,1:12] > boxplot(X) Since the...
1660 sym R (1502 sym/14 pcs) 14 img
Reverse Engineering with Correlated Features
In econometric modeling, I usually have a problem with correlated features. A few weeks ago, I was discussing feature selection when features are correlated. This week, I was wondering about reverse engineering when features might be correlated (not to say very correlated). The way I see reverse engineering is the following someone has some da...
6218 sym R (4077 sym/12 pcs) 34 img
Spatial and Temporal Viz of Gas Price, in France
A great think in France, is that we can play with a great database with gas price, in all gas stations, almost eveyday. The file is rather big, so let’s make sure we have enough memory to run our codes, > rm(list=ls()) To extract the data, first, we should extract the xml file, and then convert it in a more common R object (say a list) > year=2...
2452 sym R (4524 sym/12 pcs) 12 img
Mortality by Weekday and Age
A few days ago, I did mention on Twitter a nice graph, with Mortality by Weekday and Age https://t.co/LyzQ7nJABZ very interesting difference, young vs. old pic.twitter.com/EfrX0C1GBS — Arthur Charpentier (@freakonometrics) 27 février 2016 My colleague Jean-Philippe was extremely sceptical, so I tried to reproduce that graph. The good thing is ...
1677 sym R (8595 sym/6 pcs) 4 img