Publications by Allan Engelhardt

Area Plots with Intensity Coloring

13.07.2010

I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=...

4479 sym R (1679 sym/1 pcs) 26 img

Big data for R

05.08.2010

Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R, then you can absolutely do so and we show you how. Data preparation First you need to prepare the rather large data set that t...

3170 sym R (4659 sym/2 pcs) 24 img

Feature selection: All-relevant selection with the Boruta package

15.11.2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set...

6275 sym R (1663 sym/6 pcs) 10 img

Feature selection: All-relevant selection with the Boruta package

15.11.2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) se...

8635 sym R (1722 sym/6 pcs) 32 img

Feature selection: Using the caret package

16.11.2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Ma...

2216 sym R (4675 sym/9 pcs) 2 img

Feature selection: Using the caret package

16.11.2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. ...

5945 sym R (4645 sym/8 pcs) 24 img

Benchmarking feature selection with Boruta and caret

25.11.2010

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large da...

4664 sym R (4252 sym/5 pcs) 20 img 3 tbl

Getting started with the Heritage Health Price competition

08.04.2011

The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform. We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data. Prerequisites Obviously you need to have R insta...

1409 sym R (4538 sym/4 pcs) 14 img

Spreadsheet errors

20.04.2011

For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like i...

3292 sym R (769 sym/1 pcs) 18 img 1 tbl

Friday quote: the handmaiden and the whore

19.08.2011

Because it is Friday and because we collect quotes. If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship. Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers. They are the hyenas, jac...

1449 sym 16 img