Publications by Econometrics and Free Software
Building formulae
This Stackoverflow question made me think about how to build formulae. For example, you might want to programmatically build linear model formulae and then map these models on data. For example, suppose the following (output suppressed): data(mtcars) lm(mpg ~ hp, data = mtcars) lm(mpg ~I(hp^2), data = mtcars) lm(mpg ~I(hp^3), data = mtcars) lm(m...
1381 sym R (2074 sym/5 pcs)
It’s lists all the way down
Today, I had the opportunity to help someone over at the R for Data Science Slack group (read more about this group here) and I thought that the question asked could make for an interesting blog post, so here it is! Disclaimer: the way I’m doing things here is totally not optimal, but I want to illustrate how to map functions over nested lists....
4747 sym R (16106 sym/10 pcs)
It’s lists all the way down, part 2: We need to go deeper
Shortly after my previous blog post, I saw this tweet on my timeline: The purrr resolution for 2018 – learn at least one purrr function per week – is officially launched with encouragement and inspiration from @statwonk and @hadleywickham. We start with modify_depth: https://t.co/dCMnSHP7Pl. Please join to learn and share. #rstats— Isabe...
5972 sym R (13920 sym/9 pcs) 2 img
Mapping a list of functions to a list of datasets with a list of columns as arguments
This week I had the opportunity to teach R at my workplace, again. This course was the “advanced R” course, and unlike the one I taught at the end of last year, I had one more day (so 3 days in total) where I could show my colleagues the joys of the tidyverse and R. To finish the section on programming with R, which was the very last section ...
2941 sym R (1631 sym/7 pcs)
Predicting job search by training a random forest on an unbalanced dataset
In this blog post, I am going to train a random forest on census data from the US to predict the probability that someone is looking for a job. To this end, I downloaded the US 1990 census data from the UCI Machine Learning Repository. Having a background in economics, I am always quite interest by such datasets. I downloaded the raw data which i...
8147 sym R (9050 sym/23 pcs) 10 img
Importing 30GB of data in R with sparklyr
Disclaimer: the first part of this blog post draws heavily from Working with CSVs on the Command Line, which is a beautiful resource that lists very nice tips and tricks to work with CSV files before having to load them into R, or any other statistical software. I highly recommend it! Also, if you find this interesting, read also Data Science at ...
5946 sym R (2941 sym/14 pcs) 2 img
Keep trying that api call with purrr::possibly()
Sometimes you need to call an api to get some result from a web service, but sometimes this call might fail. You might get an error 500 for example, or maybe you’re making too many calls too fast. Regarding this last point, I really encourage you to read Ethics in Web Scraping. In this blog post I will show you how you can keep trying to make t...
2450 sym R (863 sym/6 pcs)
Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash
This is going to be the type of blog posts that would perhaps be better as a gist, but it is easier for me to use my blog as my own personal collection of gists. Plus, someone else might find this useful, so here it is! In this blog post I am going to show a little trick to randomly sample rows from a text file using bash, and then train a model ...
5098 sym R (6210 sym/11 pcs)
Get basic summary statistics for all the variables in a data frame
I have added a new function to my {brotools} package, called describe(), which takes a data frame as an argument, and returns another data frame with descriptive statistics. It is very much inspired by the {skmir} package but also by assist::describe() (click on the packages to be redirected to the respective Github repos) but I wanted to write m...
1389 sym R (1890 sym/3 pcs)
Imputing missing values in parallel using {furrr}
Today I saw this tweet on my timeline: For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you. Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew— Davis Vaughan (@dvaughan32) April 13, 201...
5397 sym R (15355 sym/16 pcs)