Publications by statcompute

Playing Map() and Reduce() in R – Subsetting

08.09.2018

In the previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation), I’ve shown how to employ the MapReduce when calculating by-group statistics. Actually, the same Divide-n-Conquer strategy can be applicable to other use cases, one of which is the subsetting operation. In the example below, let...

2683 sym R (497 sym/3 pcs) 4 img

Modeling Frequency Outcomes with Ordinal Models

10.09.2018

When modeling frequency outcomes, we often need to go beyond the standard Poisson regression due to the strict distributional assumption and to consider more flexible alternatives. In general, there are two broad categories of modeling approaches in light of practical concerns about frequency outcomes. The first category of models are mainly int...

4569 sym R (647 sym/3 pcs) 4 img

How to Avoid For Loop in R

15.09.2018

A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency. In R, there are two ways to implement the same functionality of a FOR loop. The first option is the lapply() or sapply() function that applies ...

3007 sym R (1707 sym/6 pcs)

Why Vectorize?

16.09.2018

In the post (https://statcompute.wordpress.com/2018/09/15/how-to-avoid-for-loop-in-r), I briefly introduced the idea of vectorization and potential use cases. One might be wondering why we even need the Vectorize() function given the fact that it is just a wrapper and whether there is any material efficiency gain by vectorizing a function. It is...

3574 sym R (1478 sym/1 pcs)

Union Multiple Data.Frames with Different Column Names

22.09.2018

On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed befo...

1462 sym R (959 sym/2 pcs) 2 img

By-Group Summary with SparkR – Follow-up for A Reader Comment

23.09.2018

A reader, e.g. Mr. Wayne Zhang, of my previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation) made a good comment that “Why not use directly either Spark or H2O to derive such computations without involving detailed map/reduce”. Although Spark is not as flexible as R in the statistical c...

1584 sym R (1640 sym/3 pcs)

Monotonic Binning with Equal-Sized Bads for Scorecard Development

14.10.2018

In previous posts (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) and (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve developed 2 different algorithms for monotonic binning. While the first tends to generate bins with equal densities, the second wo...

1327 sym R (2169 sym/2 pcs)

Convert Data Frame to Dictionary List in R

16.11.2018

In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists. In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient...

782 sym R (2879 sym/1 pcs)

Growing List vs Growing Queue

17.11.2018

### GROWING LIST ### base_lst1 <- function(df) { l <- list() for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### PRE-ALLOCATING LIST ### base_lst2 <- function(df) { l <- vector(mode = "list", length = nrow(df)) for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### DEQUER PACKAGE ### dequer_queue <- func...

431 sym R (3471 sym/1 pcs)

Creating List with Iterator

22.11.2018

In the post (https://statcompute.wordpress.com/2018/11/17/growing-list-vs-growing-queue), it is shown how to grow a list or a list-like queue based upon a dataframe. In the example, the code snippet was heavily relied on the FOR loop to do the assignment item by item, which I can’t help thinking of potential alternatives afterwards. For instanc...

2303 sym R (911 sym/5 pcs)