Publications by arthur charpentier

Combining automatically factor levels with trees

03.10.2019

Last year, in a post, I discussed how to merge levels of factor variables, using combinatorial techniques (it was for my STT5100 cours, and trees are not in the syllabus), with an extension on trees at the end of the post. consider the following (simulated dataset) n=200 set.seed(1) x1=runif(n) x2=runif(n) y=1+2*x1-x2+rnorm(n,0,.2) LB=sample...

1628 sym R (1290 sym/6 pcs) 4 img

On the conjugate function

13.01.2020

In the MAT7381 course (graduate course on regression models), we will talk about optimization, and a classical tool is the so-called conjugate. Given a function \(f:\mathbb{R}^p\to\mathbb{R}\) its conjugate is function \(f^{\star}:\mathbb{R}^p\to\mathbb{R}\) such that \(f^{\star}(\boldsymbol{y})=\max_{\boldsymbol{x}}\lbrace\boldsymbol{x}^\top\bol...

4027 sym 10 img

On Cochran Theorem (and Orthogonal Projections)

15.01.2020

Cochran Theorem – from The distribution of quadratic forms in a normal system, with applications to the analysis of covariance published in 1934 – is probably the most import one in a regression course. It is an application of a nice result on quadratic forms of Gaussian vectors. More precisely, we can prove that if \(\boldsymbol{Y}\sim\mathc...

6457 sym 6 img

Quantile Regression (home made, part 2)

17.02.2020

A few months ago, I posted a note with some home made codes for quantile regression… there was something odd on the output, but it was because there was a (small) mathematical problem in my equation. So since I should teach those tomorrow, let me fix them. Median Consider a sample \(\{y_1,\cdots,y_n\}\). To compute the median, solve\(\min_\mu \...

3771 sym R (1883 sym/10 pcs) 2 img

Lasso Regression (home made)

17.02.2020

To compute Lasso regression, \(\frac{1}{2}\|\mathbf{y}-\mathbf{X}\mathbf{\beta}\|_{\ell_2}^2+\lambda\|\mathbf{\beta}\|_{\ell_1}\)define the soft-thresholding function\(S(z,\gamma)=\text{sign}(z)\cdot(|z|-\gamma)_+=\begin{cases}z-\gamma&\text{ if }\gamma>|z|\text{ and }z<0\\z+\gamma&\text{ if }\gamma soft_thresholding = function(x,a){ sign(x) * p...

1997 sym R (1543 sym/6 pcs) 2 img

Testing for a causal effect (with 2 time series)

19.02.2020

A few days ago, I came back on a sentence I found (in a French newspaper), where someone was claiming that “… an old variable explains 85% of the change in a new variable. So we can talk about causality” and I tried to explain that it was just stupid : if we consider the regression of the temperature on day \(t+1\) against the number of cyc...

4809 sym R (2700 sym/13 pcs) 4 img

Function basis and regression

01.03.2020

In the first part of the course on linear models, we’ve seen how to construct a linear model when the vector of covariates \(\boldsymbol{x}\) is given, so that \(\mathbb{E}(Y|\boldsymbol{X}=\boldsymbol{x})\) is either simply \(\boldsymbol{x}^\top\boldsymbol{\beta}\) (for standard linear models) or a functional of \(\boldsymbol{x}^\top\boldsymbo...

7077 sym R (4878 sym/12 pcs) 16 img

Modeling pandemics (1)

19.03.2020

The most popular model to model epidemics is the so-called SIR model – or Kermack-McKendrick. Consider a population of size \(N\), and assume that \(S\) is the number of susceptible, \(I\) the number of infectious, and \(R\) for the number recovered (or immune) individuals, \(\displaystyle {\begin{aligned}&{\frac {dS}{dt}}=-{\frac {\beta IS}{N}...

2727 sym R (1514 sym/9 pcs) 6 img

Modeling pandemics (2)

20.03.2020

When introducing the SIR model, in our initial post, we got an ordinary differential equation, but we did not really discuss stability, and periodicity. It has to do with the Jacobian matrix of the system. But first of all, we had three equations for three function, but actually\(\displaystyle{{\frac{dS}{dt}}+{\frac {dI}{dt}}+{\frac {dR}{dt}}=0}\...

2381 sym R (1013 sym/7 pcs) 6 img

Modeling Pandemics (3)

20.03.2020

In Statistical Inference in a Stochastic Epidemic SEIR Model with Control Intervention, a more complex model than the one we’ve seen yesterday was considered (and is called the SEIR model). Consider a population of size \(N\), and assume that \(S\) is the number of susceptible, \(E\) the number of exposed, \(I\) the number of infectious, and \(...

3536 sym R (1186 sym/4 pcs) 4 img