Publications by statcompute

Fastest Way to Add New Variables to A Large Data.Frame


pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table") lapply(pkgs, require, character.only = T) data(hflights) benchmark(replications = 10, order = "user.self", relative = "user.self", transform = { ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ### transform(hflights, wday ...

433 sym R (1896 sym/1 pcs) 4 img

More about Flexible Frequency Models


Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (, the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed. In addition to the quasi-poisson r...

1453 sym R (1428 sym/4 pcs) 4 img

Estimate Regression with (Type-I) Pareto Response


The Type-I Pareto distribution has a probability function shown as below f(y; a, k) = k * (a ^ k) / (y ^ (k + 1)) In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 . The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually...

2102 sym R (997 sym/2 pcs) 4 img

Monotonic Binning with Smbinning Package


The R package smbinning ( provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view: 1. First of all, the underlying algorithm in the smbinning() function utili...

2505 sym R (5080 sym/5 pcs) 4 img

R Interface to Spark


SparkR library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) a...

450 sym R (1649 sym/2 pcs) 4 img

Joining Tables in SparkR


library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "mo...

433 sym R (2027 sym/1 pcs) 4 img

Finer Monotonic Binning Based on Isotonic Regression


In my early post (, I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the ...

1485 sym R (10039 sym/5 pcs) 4 img

Using Tweedie Parameter to Identify Distributions


In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity? In my previous post https://statcompute.wordpress.c...

2350 sym R (786 sym/1 pcs) 4 img

H2O Benchmark for CSV Import


The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv(). library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", spar...

659 sym R (1104 sym/1 pcs) 4 img

GLM with H2O in R


Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm(). > library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CA...

600 sym R (2281 sym/1 pcs) 4 img