Publications by mark

A multidimensional “which” function

16.09.2011

update Henrik Bengtsson commented that which(x, arr.ind=TRUE) gives the same result, rendering the blog below academic (thanks for the comment!). So, for academic interest, I’ll leave it. In my defense, I implemented this kind of functionality in C some time ago, so I did not even think about RTFM. (So there’s a lesson in this blog after all ...

2899 sym R (934 sym/4 pcs) 6 img 4 tbl

What do your rules look like? editrules 1.8-x answers with the help of igraph

26.10.2011

We (Edwin de Jonge and me) have recently updated our editrules package. The most important new features include (beta) support for categorical data. However, in this post I’m going to show some visualizations we included, made possible by Gabor Csardi’s awesome igraph package. Make sure you run ?Download download.txt1 install.packages(c('igr...

3658 sym R (2138 sym/20 pcs) 8 img 10 tbl

Deductive imputation with the deducorrect package

26.11.2011

Missing data hinders statistical analyses. Estimating missing values (imputation) prior to analysis is one way to deal with that. In some cases however, the missings need not be estimated at all, since they can be derived with certainty from other data which is present. The latest version of our package deducorrect can do this for numerical as we...

1836 sym R (606 sym/2 pcs) 20 img 1 tbl

Representation of numerical NA’s in R and the 1954 enigma

08.07.2012

I’ve always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short: A missing integer is repesented by the largest negativ...

3762 sym R (607 sym/4 pcs) 24 img 2 tbl

The stringdist package

26.02.2013

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another. String distance functions seem to have been partly missing and partly sca...

2479 sym

Approximate string matching in R

09.08.2013

I have released a new version of the stringdist package. Besides a some new string distance algorithms it now contains two convenient matching functions: amatch: Equivalent to R’s match function but allowing for approximate matching. ain: Similar to R’s %in% operator ?Download download.txt1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20...

1376 sym R (501 sym/2 pcs) 1 tbl

A bit of benchmarking with string distances

07.09.2013

After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings of 5-25 characters, which probably covers many use cases invol...

4778 sym R (3168 sym/12 pcs) 6 tbl

Review of “Building interactive graphs with ggplot2 and shiny”

04.08.2014

Recently, Packt published a video course with the above title, and I’ve just spent a pleasant morning reviewing it on Packt’s request. Pleasant, because I think the course gives an excellent introduction to both ggplot2 and shiny. The course is authored by Christophe Ladroue. Course material Both video and sound are of good quality, and Chris...

4402 sym

sort.data.frame

15.08.2014

I came accross this post on SO, where several solutions to sorting data.frames are presented. It must have been solved a million times, but here’s a solution I like to use. It benefits from the fact that sort is an S3 generic.sort.data.frame <- function(x, decreasing=FALSE, by=1, ... ){ f <- function(...) order(...,decreasing=decreasing) i ...

741 sym R (297 sym/2 pcs)

stringdist 0.8: now with soundex

22.08.2014

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words that sound similar (when pronounced in English) to the same code. Soundex codes can be computed with the new...

1543 sym R (624 sym/6 pcs)