Publications by Florian Privé
Whether to use a data frame in R?
In this post, I try to show you in which situations using a data frame is appropriate, and in which it’s not. Learn more with the Advanced R book. What is a data frame? A data frame is just a list of vectors of the same length, each vector being a column. This may convince you: str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sep...
2238 sym R (2695 sym/5 pcs)
Fast R functions to get first principal components
In this post, I compare different approaches to get first principal components of large matrices in R. Comparison library(bigstatsr) library(tidyverse) Data # Create two matrices, one with some structure, one without n <- 20e3 seq_m <- c(1e3, 3e3, 10e3) sizes <- seq_along(seq_m) X <- E <- list() for (i in sizes) { m <- seq_m[i] U <- matrix(...
1898 sym R (4570 sym/8 pcs) 2 img
Predicting height based on DNA mutations
In this post, I show some results of predicting height based on DNA mutations. This analysis aims at reproducing the analysis of this paper using my own analysis tools in. I use a new dataset composed of 500,000 adults from UK, and genotyped over hundreds of thousands of DNA positions. This dataset is called the UK biobank, and also provide some ...
2895 sym R (298 sym/1 pcs) 6 img
Choosing hyper-parameters in penalized regression
In this post, I’m evaluating some ways of choosing hyper-parameters (\(\alpha\) and \(\lambda\)) in penalized linear regression. The same principles can be applied to other types of penalized regresions (e.g. logistic). Model In penalized linear regression, we find regression coefficients \(\hat{\beta}_0\) and \(\hat{\beta}\) that minimize th...
5509 sym R (3004 sym/3 pcs) 4 img
Using clustering to find points in an image
In this post, I present my new package {img2coord}. This package can be used to retrieve coordinates from a scatter plot (as an image). devtools::install_github("privefl/img2coord") Have you ever made a plot, saved it as a png and moved on? When you come back to it, it is sometimes difficult to read the values from this plot, especially if there ...
3207 sym R (6597 sym/20 pcs) 30 img
Detecting outlier samples in PCA
In this post, I present something I am currently investigating (feedback welcome!) and that I am implementing in my new package {bigutilsr}. This package can be used to detect outlier samples in Principal Component Analysis (PCA). remotes::install_github("privefl/bigutilsr") library(bigutilsr) I present three different statistics of outlierness ...
7919 sym R (3930 sym/29 pcs) 40 img