Publications by mark

Easy to use option settings management with the ‘settings’ package

05.11.2014

Last week I released a new package called settings. It grew out of my frustration built up during several small projects where I’m generating heavily parameterized d3/js output. What I wanted was support to define a whole bunch of option settings with default values; be able to set them globally or locally within a function or object without...

2398 sym R (670 sym/5 pcs)

stringdist 0.9: exercise all your cores

26.01.2015

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R. Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on ...

2972 sym R (109 sym/2 pcs) 6 img

Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

24.06.2015

On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples. Computing ‘dist’ objects with ‘stringdistmatrix’ The R dist object is used as input for many clustering algorithms such as cluster::hclust. It is stores the lower triangle of a matrix...

2992 sym R (1066 sym/5 pcs) 8 img

stringdist 0.9.4 and 0.9.3: distances between integer sequences

27.10.2015

A new release of stringdist has been accepted on CRAN. stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding. version 0.9.4 bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused UBSAN to fire, but gave correct results). bugfix in jw di...

1613 sym

settings 0.2.3

27.10.2015

An updated version of the settings package has been accepted on CRAN. The settings package provides alternative options settings management for R. It is aimed to allow for layered options management where global options are the default that can easily be overruled locally (e.g. when calling a function, or options as part of an object). New featur...

855 sym

Easy data validation with the validate package

25.03.2016

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example. library(magrittr) library(validate) iris %>% check_that( Sepal.Width > 0.5 * Sepal.Length , mean(Sepal.Width) > 0 , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) %>% summary() # rule items passes fails nN...

1577 sym R (1159 sym/3 pcs) 2 img

validate version 1.5 is out

24.06.2016

A new version of the validate package for data validation was just accepted on CRAN and will be available on all mirrors in a few days. The most important addition is that you can now reference the data set as a whole, using the “dot” syntax like so: iris %>% check_that( nrow(.)>100 , "Sepal.Width" %in% names(.)) %>% summary() rule ...

2096 sym R (503 sym/2 pcs)

stringdist 0.9.4.2 released

11.09.2016

stringdist 0.9.4.2 was accepted on CRAN at the end of last week. This release just fixes a few bugs affecting the stringdistmatrix function, when called with a single argument. From the NEWS file: bugfix in stringdistmatrix(a): value of p, for jw-distance was ignored (thanks to Max Fritsche) bugfix in stringdistmatrix(a): Would segfault on q-gra...

1306 sym 2 img

Announcing the simputation package: make imputation simple

13.09.2016

I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that make it easy to define both imputation method and imputation model; for multiple variables at once; while grouping data by categorical variables; all fi...

2728 sym R (3198 sym/7 pcs)

Track changes in data with the lumberjack %>>%

23.06.2017

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R > data(retailers, package="validate") > head(retailers, 3) size incl.prob staff turnover other.rev total.rev staff.costs total.cost...

2126 sym R (1613 sym/4 pcs) 4 img