Publications by Econometrics and Free Software

Predicting job search by training a random forest on an unbalanced dataset

10.02.2018

In this blog post, I am going to train a random forest on census data from the US to predict the probability that someone is looking for a job. To this end, I downloaded the US 1990 census data from the UCI Machine Learning Repository. Having a background in economics, I am always quite interest by such datasets. I downloaded the raw data which i...

8147 sym R (9050 sym/23 pcs) 10 img

Importing 30GB of data in R with sparklyr

15.02.2018

Disclaimer: the first part of this blog post draws heavily from Working with CSVs on the Command Line, which is a beautiful resource that lists very nice tips and tricks to work with CSV files before having to load them into R, or any other statistical software. I highly recommend it! Also, if you find this interesting, read also Data Science at ...

5946 sym R (2941 sym/14 pcs) 2 img

Keep trying that api call with purrr::possibly()

02.03.2018

Sometimes you need to call an api to get some result from a web service, but sometimes this call might fail. You might get an error 500 for example, or maybe you’re making too many calls too fast. Regarding this last point, I really encourage you to read Ethics in Web Scraping. In this blog post I will show you how you can keep trying to make t...

2450 sym R (863 sym/6 pcs)

Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash

02.03.2018

This is going to be the type of blog posts that would perhaps be better as a gist, but it is easier for me to use my blog as my own personal collection of gists. Plus, someone else might find this useful, so here it is! In this blog post I am going to show a little trick to randomly sample rows from a text file using bash, and then train a model ...

5098 sym R (6210 sym/11 pcs)

Get basic summary statistics for all the variables in a data frame

09.04.2018

I have added a new function to my {brotools} package, called describe(), which takes a data frame as an argument, and returns another data frame with descriptive statistics. It is very much inspired by the {skmir} package but also by assist::describe() (click on the packages to be redirected to the respective Github repos) but I wanted to write m...

1389 sym R (1890 sym/3 pcs)

Imputing missing values in parallel using {furrr}

13.04.2018

Today I saw this tweet on my timeline: For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you. Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew— Davis Vaughan (@dvaughan32) April 13, 201...

5397 sym R (15355 sym/16 pcs)

{pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr}

14.04.2018

Yesterday I wrote this blog post which showed how one could use {furrr} and {mice} to impute missing data in parallel, thus speeding up the process tremendously. To make using this snippet of code easier, I quickly cobbled together an experimental package called {pmice} that you can install from Github: devtools::install_github("b-rodrigues/pmic...

955 sym R (45 sym/1 pcs)

Getting data from pdfs using the pdftools package

09.06.2018

It is often the case that data is trapped inside pdfs, but thankfully there are ways to extract it from the pdfs. A very nice package for this task is pdftools (Github link) and this blog post will describe some basic functionality from that package. First, let’s find some pdfs that contain interesting data. For this post, I’m using the diabe...

4574 sym R (14721 sym/14 pcs) 6 img

Forecasting my weight with R

23.06.2018

I’ve been measuring my weight almost daily for almost 2 years now; I actually started earlier, but not as consistently. The goal of this blog post is to get re-acquaiented with time series; I haven’t had the opportunity to work with time series for a long time now and I have seen that quite a few packages that deal with time series have been ...

6508 sym R (9948 sym/30 pcs) 8 img

Missing data imputation and instrumental variables regression: the tidy approach

30.06.2018

In this blog post I will discuss missing data imputation and instrumental variables regression. This is based on a short presentation I will give at my job. You can find the data used here on this website: http://eclr.humanities.manchester.ac.uk/index.php/IV_in_R The data is used is from Wooldridge’s book, Econometrics: A modern Approach. You c...

7156 sym R (8573 sym/24 pcs) 8 img