Publications by Allan Engelhardt
Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=...
4479 sym R (1679 sym/1 pcs) 26 img
Big data for R
Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R, then you can absolutely do so and we show you how. Data preparation First you need to prepare the rather large data set that t...
3170 sym R (4659 sym/2 pcs) 24 img
Feature selection: All-relevant selection with the Boruta package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set...
6275 sym R (1663 sym/6 pcs) 10 img
Feature selection: All-relevant selection with the Boruta package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) se...
8635 sym R (1722 sym/6 pcs) 32 img
Feature selection: Using the caret package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Ma...
2216 sym R (4675 sym/9 pcs) 2 img
Feature selection: Using the caret package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. ...
5945 sym R (4645 sym/8 pcs) 24 img
Benchmarking feature selection with Boruta and caret
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large da...
4664 sym R (4252 sym/5 pcs) 20 img 3 tbl
Getting started with the Heritage Health Price competition
The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform. We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data. Prerequisites Obviously you need to have R insta...
1409 sym R (4538 sym/4 pcs) 14 img
Spreadsheet errors
For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like i...
3292 sym R (769 sym/1 pcs) 18 img 1 tbl
Friday quote: the handmaiden and the whore
Because it is Friday and because we collect quotes. If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship. Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers. They are the hyenas, jac...
1449 sym 16 img