Publications by statcompute
Playing Map() and Reduce() in R – Subsetting
In the previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation), I’ve shown how to employ the MapReduce when calculating by-group statistics. Actually, the same Divide-n-Conquer strategy can be applicable to other use cases, one of which is the subsetting operation. In the example below, let...
2683 sym R (497 sym/3 pcs) 4 img
Modeling Frequency Outcomes with Ordinal Models
When modeling frequency outcomes, we often need to go beyond the standard Poisson regression due to the strict distributional assumption and to consider more flexible alternatives. In general, there are two broad categories of modeling approaches in light of practical concerns about frequency outcomes. The first category of models are mainly int...
4569 sym R (647 sym/3 pcs) 4 img
How to Avoid For Loop in R
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency. In R, there are two ways to implement the same functionality of a FOR loop. The first option is the lapply() or sapply() function that applies ...
3007 sym R (1707 sym/6 pcs)
Why Vectorize?
In the post (https://statcompute.wordpress.com/2018/09/15/how-to-avoid-for-loop-in-r), I briefly introduced the idea of vectorization and potential use cases. One might be wondering why we even need the Vectorize() function given the fact that it is just a wrapper and whether there is any material efficiency gain by vectorizing a function. It is...
3574 sym R (1478 sym/1 pcs)
Union Multiple Data.Frames with Different Column Names
On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed befo...
1462 sym R (959 sym/2 pcs) 2 img
By-Group Summary with SparkR – Follow-up for A Reader Comment
A reader, e.g. Mr. Wayne Zhang, of my previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation) made a good comment that “Why not use directly either Spark or H2O to derive such computations without involving detailed map/reduce”. Although Spark is not as flexible as R in the statistical c...
1584 sym R (1640 sym/3 pcs)
Monotonic Binning with Equal-Sized Bads for Scorecard Development
In previous posts (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) and (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve developed 2 different algorithms for monotonic binning. While the first tends to generate bins with equal densities, the second wo...
1327 sym R (2169 sym/2 pcs)
Convert Data Frame to Dictionary List in R
In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists. In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient...
782 sym R (2879 sym/1 pcs)
Growing List vs Growing Queue
### GROWING LIST ### base_lst1 <- function(df) { l <- list() for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### PRE-ALLOCATING LIST ### base_lst2 <- function(df) { l <- vector(mode = "list", length = nrow(df)) for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### DEQUER PACKAGE ### dequer_queue <- func...
431 sym R (3471 sym/1 pcs)
Creating List with Iterator
In the post (https://statcompute.wordpress.com/2018/11/17/growing-list-vs-growing-queue), it is shown how to grow a list or a list-like queue based upon a dataframe. In the example, the code snippet was heavily relied on the FOR loop to do the assignment item by item, which I can’t help thinking of potential alternatives afterwards. For instanc...
2303 sym R (911 sym/5 pcs)