Publications by arthur charpentier
Visualising a Classification in High Dimension
So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it star...
2770 sym R (1409 sym/9 pcs) 8 img
Some Intuition About the Theory of Statistical Learning
While I was working on the Theory of Statistical Learning, and the concept of consistency, I found the following popular graph (e.g. from thoses slides, here in French) The curve below is the error on the training sample, as a function of the size of the training sample. Above, it is the error on a validation sample. Our learning process is co...
1577 sym R (1298 sym/2 pcs) 12 img
Some More Results on the Theory of Statistical Learning
Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following, Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as for some loss function . Why is it complicated ? From the law of lar...
2251 sym R (1058 sym/2 pcs) 52 img
Growing some Trees
Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features), > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + header=TRUE,sep=";") The default classification tree is > arbre = rpart(factor(PRONO)~.,data=MYOCARDE) > rpart.plot(arbre,type=4,extra=6) We can change the opt...
1251 sym R (2415 sym/11 pcs) 16 img
Forecast, Automatic Routines vs. Experience
This morning, in our Time Series course, we’ve been playing with some data I got from google.ca/trends/. Actually, we’ve been playing on some old version, downloaded 18 months ago (discussed in a previous post, in French). > urls = "http://freakonometrics.free.fr/report-headphones-2015.csv" > report=read.table( + urls,skip=4,header=TRUE,sep="...
2685 sym R (4840 sym/21 pcs) 30 img
Regression Models, It’s Not Only About Interpretation
Yesterday, I did upload a post where I tried to show that “standard” regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess so...
2678 sym R (3098 sym/12 pcs) 18 img
Spliting a Node in a Tree
If we grow a tree with standard functions in R, on the same dataset used to introduce classification tree in some previous post, > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + head=TRUE,sep=";") > library(rpart) > cart<-rpart(PRONO~.,data=MYOCARDE) we get > library(rpart.plot) > library(rattle) > prp(cart,type=2,extra=1)...
1460 sym R (1824 sym/8 pcs) 12 img
Interactive Maps for John Snow’s Cholera Data
This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the leaflet package, devtools::install_github("rstudio/leaflet") require(leaflet) To see what can be done with that package, we will use on...
2560 sym R (2337 sym/11 pcs) 18 img
Another Interactive Map for the Cholera Dataset
Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include. This time, we do not use openstreetmap. The first part is still the same, to get the data, > require(rleafmap) > library(sp) > library(rgdal) ...
1216 sym R (1780 sym/5 pcs) 10 img
Classification with Categorical Variables (the fuzzy side)
The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) [1] 42.98 > mean(cars$dist) [1] 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=...
2079 sym R (3877 sym/13 pcs) 4 img