Publications by statcompute
Fastest Way to Add New Variables to A Large Data.Frame
pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table") lapply(pkgs, require, character.only = T) data(hflights) benchmark(replications = 10, order = "user.self", relative = "user.self", transform = { ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ### transform(hflights, wday ...
433 sym R (1896 sym/1 pcs) 4 img
More about Flexible Frequency Models
Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed. In addition to the quasi-poisson r...
1453 sym R (1428 sym/4 pcs) 4 img
Estimate Regression with (Type-I) Pareto Response
The Type-I Pareto distribution has a probability function shown as below f(y; a, k) = k * (a ^ k) / (y ^ (k + 1)) In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 . The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually...
2102 sym R (997 sym/2 pcs) 4 img
Monotonic Binning with Smbinning Package
The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view: 1. First of all, the underlying algorithm in the smbinning() function utili...
2505 sym R (5080 sym/5 pcs) 4 img
R Interface to Spark
SparkR library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) a...
450 sym R (1649 sym/2 pcs) 4 img
Joining Tables in SparkR
library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "mo...
433 sym R (2027 sym/1 pcs) 4 img
Finer Monotonic Binning Based on Isotonic Regression
In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the ...
1485 sym R (10039 sym/5 pcs) 4 img
Using Tweedie Parameter to Identify Distributions
In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity? In my previous post https://statcompute.wordpress.c...
2350 sym R (786 sym/1 pcs) 4 img
H2O Benchmark for CSV Import
The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv(). library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", spar...
659 sym R (1104 sym/1 pcs) 4 img
GLM with H2O in R
Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm(). > library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CA...
600 sym R (2281 sym/1 pcs) 4 img