Publications by chris2016
Easy quick PCA analysis in R
Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.or...
4116 sym R (760 sym/5 pcs) 10 img
Quick and easy t-SNE analysis in R
t-SNE is a useful dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex patterns of Gaussian clusters in multidimensional space compared to PCA. Although is not suited to finding outliers because how the sam...
3952 sym R (451 sym/4 pcs) 8 img
Running UMAP for data visualisation in R
UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest. It can...
3712 sym R (589 sym/5 pcs) 10 img
Fast adaptive spectral clustering in R (brain cancer RNA-seq)
Spectral clustering refers to a family of algorithms that cluster eigenvectors derived from the matrix that represents the input data’s graph. An important step in this method is running the kernel function that is applied on the input data to generate a NXN similarity matrix or graph (where N is our number of input observations). Subsequent st...
6064 sym R (1803 sym/3 pcs) 6 img
How to easily make a ROC curve in R
A typical task in evaluating the results of machine learning models is making a ROC curve, this plot can inform the analyst how well a model can discriminate one class from a second. We developed MLeval (https://cran.r-project.org/web/packages/MLeval/index.html), a evaluation package for R, to make ROC curves, PR curves, PR gain curves, and calib...
2948 sym R (746 sym/2 pcs) 6 img
How to make a precision recall curve in R
Precision recall (PR) curves are useful for machine learning model evaluation when there is an extreme imbalance in the data and the analyst is interested particuarly in one class. A good example is credit card fraud, where the instances of fraud are extremely few compared with non fraud. Here are some facts about PR curves. PR curves are sensit...
2641 sym R (585 sym/1 pcs) 4 img
Consensus clustering in R
The logic behind the Monti consensus clustering algorithm is that in the face of resampling the ideal clusters should be stable, thus any pair of samples should either always or never cluster together. We can use this principle to infer the optimal number of clusters (K). This works by examining cluster stability from K=2 to K=10 during resamplin...
6431 sym R (139 sym/1 pcs) 6 img
Part 6: How not to validate your model with optimism corrected bootstrapping
When evaluating a machine learning model if the same data is used to train and test the model this results in overfitting. So the model performs much better in predictive ability than it would if it was applied on completely new data, this is because the model uses random noise within the data to learn from and make predictions. However, new da...
5604 sym R (3824 sym/2 pcs) 4 img