Publications by chris2016
Bias in high dimensional optimism corrected bootstrap procedure
I have been working in high dimensional analysis to predict drug response in rheumatoid arthritis patients and I was concerned to find the procedure called optimism corrected bootstrapping over-fits as p (number of features) increases. Optimism corrected bootstrapping is a way of trying to estimate the overfitting error of a dataset by resampling...
2708 sym R (2125 sym/6 pcs)
Forcasting the price of bitcoin with the CRAN forecast package
There is interest in bitcoin at the moment because it is displaying signs of steady year to year growth with brief boosts followed by rapid declines. It is considered a risky investment by investors yet, has the potential for high returns in a fairly short duration (1-2 years). John McAfee, inventor of McAfee anti virus software has reasoned that...
6938 sym R (879 sym/3 pcs) 6 img
How to perform consensus clustering without overfitting and reject the null hypothesis
The Monti et al. (2003) consensus clustering algorithm is one of the most widely used class discovery techniques in the genome sciences and is commonly used to cluster transcriptomic, epigenetic, proteomic, and a range of other types of data. It can automatically decide the number of classes (K), by resampling the data and for each K (e.g. 2-10) ...
5522 sym R (1547 sym/4 pcs) 8 img
Simulating NXN dimensional Gaussian clusters in R
Gaussian clusters are found in a range of fields and simulating them is important as often we will want to test a given class discovery tools performance under conditions where the ground truth is known (e.g. K=6). However, a flexible Gaussian cluster simulator for simulating Gaussian clusters with defined variance, spacing, and size does not exi...
3200 sym R (602 sym/5 pcs) 10 img
Optimism corrected bootstrapping: a problematic method
There are lots of ways to assess how predictive a model is while correcting for overfitting. In Caret the main methods I use are leave one out cross validation, for when we have relatively few samples, and k fold cross validation when we have more. There also is another method called ‘optimism corrected bootstrapping’, that attempts to save s...
2687 sym R (1855 sym/1 pcs) 2 img
Part 2: Optimism corrected bootstrapping is definitely bias, further evidence
Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on this nor the interest s...
4701 sym R (2349 sym/1 pcs) 4 img
Part 3: Two more implementations of optimism corrected bootstrapping show shocking bias
Welcome to part III of debunking the optimism corrected bootstrap in high dimensions (quite high number of features) in the Christmas holidays. Previously we saw with a reproducible code implementation that this method is very bias when we have many features (50-100 or more). I suggest avoiding this method until at some point it has been reassess...
4741 sym R (4201 sym/3 pcs) 2 img
Part 4: Why does bias occur in optimism corrected bootstrapping?
In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, und...
4590 sym R (3951 sym/2 pcs) 4 img
Part 5: Code corrections to optimism corrected bootstrapping series
The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method giv...
3934 sym R (3970 sym/2 pcs) 4 img
Using clusterlab to benchmark clustering algorithms
Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, publis...
3171 sym R (466 sym/4 pcs) 6 img