Publications by statcompute

Fastest Way to Add New Variables to A Large Data.Frame

30.10.2016

pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table") lapply(pkgs, require, character.only = T) data(hflights) benchmark(replications = 10, order = "user.self", relative = "user.self", transform = { ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ### transform(hflights, wday ...

433 sym R (1896 sym/1 pcs) 4 img

More about Flexible Frequency Models

27.11.2016

Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed. In addition to the quasi-poisson r...

1453 sym R (1428 sym/4 pcs) 4 img

Estimate Regression with (Type-I) Pareto Response

11.12.2016

The Type-I Pareto distribution has a probability function shown as below f(y; a, k) = k * (a ^ k) / (y ^ (k + 1)) In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 . The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually...

2102 sym R (997 sym/2 pcs) 4 img

Monotonic Binning with Smbinning Package

22.01.2017

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view: 1. First of all, the underlying algorithm in the smbinning() function utili...

2505 sym R (5080 sym/5 pcs) 4 img

R Interface to Spark

08.06.2017

SparkR library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) a...

450 sym R (1649 sym/2 pcs) 4 img

Joining Tables in SparkR

12.06.2017

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "mo...

433 sym R (2027 sym/1 pcs) 4 img

Finer Monotonic Binning Based on Isotonic Regression

15.06.2017

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the ...

1485 sym R (10039 sym/5 pcs) 4 img

Using Tweedie Parameter to Identify Distributions

24.06.2017

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity? In my previous post https://statcompute.wordpress.c...

2350 sym R (786 sym/1 pcs) 4 img

H2O Benchmark for CSV Import

25.06.2017

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv(). library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", spar...

659 sym R (1104 sym/1 pcs) 4 img

GLM with H2O in R

27.06.2017

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm(). > library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CA...

600 sym R (2281 sym/1 pcs) 4 img