Publications by arthur charpentier

Visualising a Classification in High Dimension

06.03.2015

So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it star...

2770 sym R (1409 sym/9 pcs) 8 img

Some Intuition About the Theory of Statistical Learning

07.03.2015

While I was working on the Theory of Statistical Learning, and the concept of consistency, I found the following popular graph (e.g. from  thoses slides, here in French) The curve below is the error on the training sample, as a function of the size of the training sample. Above, it is the error on a validation sample. Our learning process is co...

1577 sym R (1298 sym/2 pcs) 12 img

Some More Results on the Theory of Statistical Learning

08.03.2015

Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following, Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as for some loss function . Why is it complicated ? From the law of lar...

2251 sym R (1058 sym/2 pcs) 52 img

Growing some Trees

18.03.2015

Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features), > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + header=TRUE,sep=";") The default classification tree is > arbre = rpart(factor(PRONO)~.,data=MYOCARDE) > rpart.plot(arbre,type=4,extra=6) We can change the opt...

1251 sym R (2415 sym/11 pcs) 16 img

Forecast, Automatic Routines vs. Experience

18.03.2015

This morning, in our Time Series course, we’ve been playing with some data I got from google.ca/trends/. Actually, we’ve been playing on some old version, downloaded 18 months ago (discussed in a previous post, in French). > urls = "http://freakonometrics.free.fr/report-headphones-2015.csv" > report=read.table( + urls,skip=4,header=TRUE,sep="...

2685 sym R (4840 sym/21 pcs) 30 img

Regression Models, It’s Not Only About Interpretation

22.03.2015

Yesterday, I did upload a post where I tried to show that “standard” regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess so...

2678 sym R (3098 sym/12 pcs) 18 img

Spliting a Node in a Tree

23.03.2015

If we grow a tree with standard functions in R, on the same dataset used to introduce classification tree in some previous post, > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + head=TRUE,sep=";") > library(rpart) > cart<-rpart(PRONO~.,data=MYOCARDE) we get > library(rpart.plot) > library(rattle) > prp(cart,type=2,extra=1)...

1460 sym R (1824 sym/8 pcs) 12 img

Interactive Maps for John Snow’s Cholera Data

28.03.2015

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the  leaflet package, devtools::install_github("rstudio/leaflet") require(leaflet) To see what can be done with that package, we will use on...

2560 sym R (2337 sym/11 pcs) 18 img

Another Interactive Map for the Cholera Dataset

31.03.2015

Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include. This time, we do not use openstreetmap. The first part is still the same, to get the data, > require(rleafmap) > library(sp) > library(rgdal) ...

1216 sym R (1780 sym/5 pcs) 10 img

Classification with Categorical Variables (the fuzzy side)

09.04.2015

The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) [1] 42.98 > mean(cars$dist) [1] 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=...

2079 sym R (3877 sym/13 pcs) 4 img