Publications by Abhijit

A word of warning about grep, which and the like

13.07.2011

I’ve often selected columns or rows of a data frame using grep or which, based on some property. That is inherently sound, but the trouble comes when you wish to remove rows or columns based on that grep or which call, e.g., dat <- dat[,-grep('\\.1', names(dat))] which would remove columns with a .1 in the name. This is fine the first time ar...

1135 sym R (97 sym/2 pcs) 4 img

RStudio 0.94.92 visited

30.07.2011

I just updated my RStudio version to the latest, v.0.94.92 (will this asymptotically approach 1, or actually get to 1?). It was nice to see the number of improvements the development team has implemented, based I’m sure on community feedback. The team has, in my experience, been extraordinarily responsive to user feedback, and I’m sure this p...

2481 sym 4 img

An enhanced Kaplan-Meier plot, updated

01.09.2011

I’ve updated the R code for the enhanced K-M plot to include additions and improvements by Gil Thomas and Mark Cowley. Thanks fellows for the feedback and updates. http://statbandit.wordpress.com/2011/03/08/an-enhanced-kaplan-meier-plot/ Related To leave a comment for the author, please follow the link and comment on their blog: Stat Bandit ...

638 sym 4 img

Pocketbook costs of software

23.02.2012

I have always been provided SAS as part of my job, so I never really realized how much it cost. I’ve bought Stata before, and of course R . I recently found out how much a reasonable bundle of SAS modules along with base SAS costs per year per seat, at least under the GSA. I tried finding out how much IBM SPSS is for a comparable bundle, but t...

1344 sym 6 img

Kaplan-Meier plots using ggplots2 (updated)

01.04.2014

About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to 0.9.3.1) and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist...

1268 sym 4 img

The need for documenting functions

22.05.2014

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code! One of the common probl...

2676 sym R (749 sym/1 pcs) 4 img

Practical Data Science Cookbook

10.11.2014

Practical Data Science Cookbook My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innova...

1752 sym 6 img

“LaF”-ing about fixed width formats

10.11.2014

If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is...

4818 sym R (2047 sym/7 pcs) 4 img

Creating new data with max values for each subject

01.12.2014

We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject. An R solution Using the excellent R package dplyr, we can do this using w...

1406 sym R (471 sym/4 pcs) 4 img

Annotated Facets with ggplot2

20.10.2016

I was recently asked to do a panel of grouped boxplots of a continuous variable, with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem. First of all, one caveat. I’m a huge fa...

2797 sym R (1780 sym/10 pcs) 12 img