Publications by arthur charpentier

How Could Classification Trees Be So Fast on Categorical Variables?

08.12.2015

I think that over the past months, I have been saying non-correct things about classification with categorical covariates. Because I never took time to look at it carefuly. Consider some simulated dataset, with a logistic regression, > n=1e3 > set.seed(1) > X1=runif(n) > q=quantile(X1,(0:26)/26) > q[1]=0 > X2=cut(X1,q,labels=LETTERS[1:26]) > p=e...

2276 sym R (2023 sym/18 pcs) 32 img

Regression with Splines: Should we care about Non-Significant Components?

04.01.2016

Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression.  Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant?...

2830 sym R (2713 sym/11 pcs) 16 img

Confidence Regions for Parameters in the Simplex

18.01.2016

Consider here the case where, in some parametric inference problem, parameter  is a point in the Simplex, For instance, consider some regression, on compositional data, > library(compositions) > data(DiagnosticProb) > Y=DiagnosticProb[,"type"]-1 > X=DiagnosticProb[,c("A","B","C")] > model = glm(Y~ilr(X),family=binomial) > b = ilrInv(coef(m...

1006 sym R (1944 sym/5 pcs) 8 img

Simple Distributions for Mixtures?

03.02.2016

The idea of GLMs is that given some covariates,  has a distribution in the exponential family (Gaussian, Poisson, Gamma, etc). But that does not mean that  has a similar distribution… so there is no reason to test for a Gamma model for  before running a Gamma regression, for instance. But are there cases where it might work? That the non-...

3167 sym R (542 sym/3 pcs) 64 img

Clusters of (French) Regions

09.02.2016

For the data scienec course of tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections > elections2012=read.table( "http://freakonometrics.free.fr/elections_2012_T1.csv",sep=";",dec=",",header=TRUE) > voix=which(substr(names( + elections2012),1,11)=="X..Voix.Exp") > election...

948 sym R (1702 sym/7 pcs) 12 img

Clusters of Texts

10.02.2016

Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example,  inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics, > library(tm) > library(stringi) > library(proxy) > titles = c("Boosting_(machine_learning)", + ...

2032 sym R (2807 sym/9 pcs) 4 img

Clustering French Cities (based on Temperatures)

11.02.2016

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities. > temp=read.table( + "http://freakonometrics.free.fr/FR_temp.txt", + header=TRUE,dec=",") We have 15 cities, with monthly observations > X=temp[,1:12] > boxplot(X) Since the...

1660 sym R (1502 sym/14 pcs) 14 img

Reverse Engineering with Correlated Features

11.02.2016

In econometric modeling, I usually have a problem with correlated features. A few weeks ago, I was discussing feature selection when features are correlated. This week, I was wondering about reverse engineering when features might be correlated (not to say very correlated). The way I see reverse engineering is the following someone has some da...

6218 sym R (4077 sym/12 pcs) 34 img

Spatial and Temporal Viz of Gas Price, in France

25.02.2016

A great think in France, is that we can play with a great database with gas price, in all gas stations, almost eveyday. The file is rather big, so let’s make sure we have enough memory to run our codes, > rm(list=ls()) To extract the data, first, we should extract the xml file, and then convert it in a more common R object (say a list) > year=2...

2452 sym R (4524 sym/12 pcs) 12 img

Mortality by Weekday and Age

27.02.2016

A few days ago, I did mention on Twitter a nice graph, with Mortality by Weekday and Age https://t.co/LyzQ7nJABZ very interesting difference, young vs. old pic.twitter.com/EfrX0C1GBS — Arthur Charpentier (@freakonometrics) 27 février 2016 My colleague Jean-Philippe was extremely sceptical, so I tried to reproduce that graph. The good thing is ...

1677 sym R (8595 sym/6 pcs) 4 img