Publications by Stephen Turner

Prune GWAS data in R

29.03.2011

Hansong Wang, our biostats professor here at the Hawaii Cancer Center, generously gave me some R code that goes through a SNP annotation file (i.e. a mapfile) and selects SNPs that are at least a certain specified distance apart. You might want to do this if you’re picking a subset of SNPs for PCA, for instance. Plink has an LD prun...

1450 sym 2 img

Monday Links: 23andMe, RStudio, PacBio+Galaxy, Data Science One-Liners, Post-Linkage RFA, SSH

11.04.2011

Lately I haven’t written as many full length posts as usual, but here’s a quick roundup of a few links I’ve shared on Twitter (@genetics_blog) over the last week:First, 23andMe is having a big DNA Day Sale ($108) for the kit + 1 year of their personal genome subscription service https://www.23andme.com/.Previously mentioned R IDE RStudio r...

3096 sym 2 img

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

12.04.2011

I’m working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I’ve sequenced the region in a small sample, I’m picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence. This is easy to get for SNPs in dbSNP, but what ab...

817 sym

Using LaTeX for Math Formulas on the Web

20.04.2011

I love the idea of using R+LaTeX+Sweave for reproducible research. This is even easier now that R has a jazzy new IDE that supports Sweave syntax highlighting and automatic PDF generation. I know I’m going to take some flak for saying this, but let’s be honest here… If you’re working in the biomedical sciences, chances are, y...

812 sym

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

25.04.2011

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify. I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you’re...

815 sym

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

04.05.2011

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it’s syntax will be familiar to PLINK users. PLINK/SEQ boasts an impressive feature set for a project...

817 sym

Accessing Databases From R

09.05.2011

Jeffrey Breen put together a useful slideshow on accessing databases from R. I use RODBC every single day to access my own local MySQL server from R. I’ve had trouble with RMySQL, so I’ve always used RODBC instead after setting up my localhost MySQL server as a Windows data source. Once you get accustomed to accessing your data di...

817 sym

More Command-Line Text Munging Utilities

19.05.2011

In a previous post I linked to gcol as a quick and intuitive alternative to awk. I just stumbled across yet another set of handy text file manipulation utilities from the creators of the BEAGLE software for GWAS data imputation and analysis. In addition to several command line utilities for converting and formatting BEAGLE files, there are severa...

1740 sym 2 img

Steal This Blog!

22.06.2011

I wanted to contribute any content and code I post here to the R Programming Wikibook so I made a slight change to the Creative Commons license on this blog. All the written content is now cc-by-sa and all the code here is still open source BSD. So feel free to wholesale copy, modify, share, or redistribute anything you find here, just include a ...

895 sym 4 img

Scatterplot matrices in R

25.07.2011

I just discovered a handy function in R to produce a scatterplot matrix of selected variables in a dataset. The base graphics function is pairs(). Producing these plots can be helpful in exploring your data, especially using the second method below.Try it out on the built in iris dataset. (data set gives the measurements in cm of the ...

1965 sym R (1094 sym/3 pcs) 8 img