Publications by arthur charpentier
Automatic Detection of the Language of a Tweet
Two days ago, in my post to extract automatically my own tweets, and to generate some html list, I mentioned that it would be great if there were a function that could be used to distinguish tweets in English, and tweets in French (usually, I tweet in one of those two languages). And one more time, @3wen came to rescue me! In my previous post, I ...
2761 sym R (3295 sym/8 pcs)
Modeling Incomes and Inequalities
Last week, in our Inequality course, we’ve been looking at data. We started with some simulated data, only a few of them > library("ineq") > load(url("http://freakonometrics.free.fr/income_5.RData")) > (income=sort(income)) [1] 19233 23707 53297 61667 218662 How could we say that there is inequality in this sample? If we look at the wealth ...
3666 sym R (6225 sym/25 pcs) 22 img
Inequalities and Quantile Regression
In the course on inequality measure, we’ve seen how to compute various (standard) inequality indices, based on some sample of incomes (that can be binned, in various categories). On Thursday, we discussed the fact that incomes can be related to different variables (e.g. experience), and that comparing income inequalities between coutries can be...
1783 sym R (552 sym/6 pcs) 8 img
k-means clustering and Voronoi sets
In the context of -means, we want to partition the space of our observations into classes. each observation belongs to the cluster with the nearest mean. Here “nearest” is in the sense of some norm, usually the (Euclidean) norm. Consider the case where we have 2 classes. The means being respectively the 2 black dots. If we partition based ...
2260 sym R (831 sym/5 pcs) 42 img
Visualizing Clusters
Consider the following dataset, with (only) ten points x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) plot(x,y,pch=19,cex=2) We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities. Since the number of observations is rather small, it is actua...
1941 sym R (1730 sym/12 pcs) 30 img
John Snow, and OpenStreetMap
While I was working for a training on data visualization, I wanted to get a nice visual for John Snow’s cholera dataset. This dataset can actually be found in a great package of famous historical datasets. library(HistData) data(Snow.deaths) data(Snow.streets) One can easily visualize the deaths, on a simplified map, with the streets (here simp...
1517 sym R (955 sym/8 pcs) 12 img
John Snow, and Google Maps
In my previous post, I discussed how to use OpenStreetMaps (and standard plotting functions of R) to visualize John Snow’s dataset. But it is also possible to use Google Maps (and ggplot2 types of graphs). library(ggmap) get_london <- get_map(c(-.137,51.513), zoom=17) london <- ggmap(get_london) Again, the tricky part comes from the fact that t...
977 sym R (948 sym/5 pcs) 4 img
Supervised Classification, Logistic and Multinomial
We will start, in our Data Science course, to discuss classification techniques (in the context of supervised models). Consider the following case, with 10 points, and two classes (red and blue) > clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) > clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) > x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y <- c(.85,.95,.8,.87,.5...
2090 sym R (2981 sym/12 pcs) 26 img
Supervised Classification, discriminant analysis
Another popular technique for classification (or at least, which used to be popular) is the (linear) discriminant analysis, introduced by Ronald Fisher in 1936. Consider the same dataset as in our previous post > clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) > x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) > z <- c(...
2990 sym R (5545 sym/19 pcs) 56 img
Supervised Classification, beyond the logistic
In our data-science class, after discussing limitations of the logistic regression, e.g. the fact that the decision boundary line was a straight line, we’ve mentioned possible natural extensions. Let us consider our (now) standard dataset clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) x <- c(.4,.55,.65,.9,.1,.35,...
1520 sym R (2237 sym/9 pcs) 20 img