Publications by Econometrics and Free Software

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1

02.03.2019

Can I get enough of historical newspapers data? Seems like I don’t. I already wrote four (1, 2, 3 and 4) blog posts, but there’s still a lot to explore. This blog post uses a new batch of data announced on twitter: For all who love to analyse text, the BnL released half a million of processed newspaper articles. Historical news from 1841-1878...

7519 sym R (8654 sym/13 pcs) 4 img

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit

02.03.2019

Can I get enough of historical newspapers data? Seems like I don’t. I already wrote four (1, 2, 3 and 4) blog posts, but there’s still a lot to explore. This blog post uses a new batch of data announced on twitter: For all who love to analyse text, the BnL released half a million of processed newspaper articles. Historical news from 1841-1878...

7519 sym R (8618 sym/13 pcs) 4 img

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2

04.03.2019

In part 1 of this series I set up Vowpal Wabbit to classify newspapers content. Now, let’s use the model to make predictions and see how and if we can improve the model. Then, let’s train the model on the whole data. Step 1: prepare the data The first step consists in importing the test data and preparing it. The test data need not be large ...

3448 sym R (8052 sym/13 pcs) 4 img

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2

04.03.2019

In part 1 of this series I set up Vowpal Wabbit to classify newspapers content. Now, let’s use the model to make predictions and see how and if we can improve the model. Then, let’s train the model on the whole data. Step 1: prepare the data The first step consists in importing the test data and preparing it. The test data need not be large ...

3448 sym R (8052 sym/13 pcs) 4 img

Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`

19.03.2019

There’s a lot going on in the development version of {tidyr}. New functions for pivoting data frames, pivot_wide() and pivot_long() are coming, and will replace the current functions, spread() and gather(). spread() and gather() will remain in the package though: You may have heard a rumour that gather/spread are going away. This is simply not ...

4380 sym R (14127 sym/16 pcs) 4 img

Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`

19.03.2019

There’s a lot going on in the development version of {tidyr}. New functions for pivoting data frames, pivot_wide() and pivot_long() are coming, and will replace the current functions, spread() and gather(). spread() and gather() will remain in the package though: You may have heard a rumour that gather/spread are going away. This is simply not ...

4380 sym R (14127 sym/16 pcs) 4 img

Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}

30.03.2019

In this blog post I’m going to show you how you can extract text from scanned pdf files, or pdf files where no text recognition was performed. (For pdfs where text recognition was performed, you can read my other blog post). The pdf I’m going to use can be downloaded from here. It’s a poem titled, D’Léierchen (Dem Léiweckerche säi Lidd...

4692 sym R (4364 sym/15 pcs) 12 img

Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}

30.03.2019

In this blog post I’m going to show you how you can extract text from scanned pdf files, or pdf files where no text recognition was performed. (For pdfs where text recognition was performed, you can read my other blog post). The pdf I’m going to use can be downloaded from here. It’s a poem titled, D’Léierchen (Dem Léiweckerche säi Lidd...

4485 sym R (4364 sym/15 pcs) 12 img

Historical newspaper scraping with {tesseract} and R

06.04.2019

I have been playing around with historical newspapers data for some months now. The “obvious” type of analysis to do is NLP, but there is also a lot of numerical data inside historical newspapers. For instance, you can find these tables that show the market prices of the day in the L’Indépendance Luxembourgeoise: I wanted to see how easy ...

7506 sym R (14266 sym/17 pcs) 28 img

Historical newspaper scraping with {tesseract} and R

06.04.2019

I have been playing around with historical newspapers data for some months now. The “obvious” type of analysis to do is NLP, but there is also a lot of numerical data inside historical newspapers. For instance, you can find these tables that show the market prices of the day in the L’Indépendance Luxembourgeoise: I wanted to see how easy ...

7506 sym R (14266 sym/17 pcs) 28 img