Publications by Andrew Treadway
Does “Sell in May, Go Away” really work?
If you follow the stock market, you’ve probably heard the expression “Sell in May, Go Away.” This expression generally refers to the perceived idea that the stock market goes up between the end of October and end of April, but one should sell at the beginning of May to avoid losses. The general recommendation according to the theory is to...
6566 sym R (1900 sym/12 pcs) 10 img
How to hide a password in R with the keyring package
This post will introduce using the keyring package to hide a password. Short background The keyring package is a library designed to let you access your operating system’s credential store. In essence, it lets you store and retrieve passwords in your operating system, which allows you to avoid having a password in plaintext in an R script. Sto...
2785 sym R (299 sym/2 pcs) 2 img
Testing the Collatz Conjecture with R
Background The Collatz Conjecture is a famous unsolved problem in number theory. If you’re not familiar with it – the conjecture is very simple to understand, yet, no one has been able to mathematically prove that the conjecture is true (though it’s been shown to be true for an enormous number of cases). The conjecture states the following...
4071 sym R (1379 sym/6 pcs) 12 img
BeautifulSoup vs. Rvest
This post will compare Python’s BeautifulSoup package to R’s rvest package for web scraping. We’ll also talk about additional functionality in rvest (that doesn’t exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an...
5367 sym R (2204 sym/21 pcs) 2 img
Really large numbers in R
This post will discuss ways of handling huge numbers in R using the gmp package. The gmp package The gmp package provides us a way of dealing with really large numbers in R. For example, let’s suppose we want to multiple 10250 by itself. Mathematically we know the result should be 10500. But if we try this calculation in base R we get Inf fo...
3479 sym R (876 sym/12 pcs) 16 img
How to get an AUC confidence interval
Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a...
2525 sym R (1374 sym/7 pcs) 2 img
mapply and Map in R
An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously. Map Suppose we...
2659 sym R (345 sym/4 pcs) 6 img
How to import Python classes into R
Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: install.packag...
2880 sym R (1788 sym/7 pcs) 6 img
Evaluate your R model with MLmetrics
This post will explore using R’s MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc. Building an example model Firstly, we need to build a model to use as an example. For this post, we’ll be using a dataset on pulsar s...
3848 sym R (1118 sym/8 pcs) 4 img
How is information gain calculated?
This post will explore the mathematics behind information gain. We’ll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step i...
4106 sym R (411 sym/1 pcs) 2 img 1 tbl