Publications by statcompute
Row Search in Parallel
I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel. In the R code below, a comparison is done between the standard row search and the parallel row search with the FOREACH package. The result is very en...
909 sym R (638 sym/1 pcs) 4 img
Vector Search vs. Binary Search
# REFERENCE: # user2014.stat.ucla.edu/files/tutorial_Matt.pdf pkgs <- c('data.table', 'rbenchmark') lapply(pkgs, require, character.only = T) load('2008.Rdata') dt <- data.table(data) benchmark(replications = 10, order = "elapsed", vector_search = { test1 <- dt[ArrTime == 1500 & Origin == 'ABE', ] }, binary_search = { setkey(dt,...
433 sym R (642 sym/1 pcs) 4 img
By-Group Aggregation in Parallel
Similar to the row search, by-group aggregation is another perfect use case to demonstrate the power of split-and-conquer with parallelism. In the example below, it is shown that the homebrew by-group aggregation with foreach pakage, albeit inefficiently coded, is still a lot faster than the summarize() function in Hmisc package. load('2008.Rdat...
766 sym R (712 sym/1 pcs) 4 img
Fitting Lasso with Julia
Julia Code using RDatasets, DataFrames, GLMNet data = dataset("MASS", "Boston"); y = array(data[:, 14]); x = array(data[:, 1:13]); cv = glmnetcv(x, y); cv.path.betas[:, indmin(cv.meanloss)]; result = DataFrame(); result[:Vars] = names(data)[1:13]; result[:Beta] = cv.path.betas[:, indmin(cv.meanloss)]; result # | Row | Vars | Beta | # ...
452 sym R (1309 sym/2 pcs) 4 img
Estimating a Beta Regression with The Variable Dispersion in R
pkgs <- c('sas7bdat', 'betareg', 'lmtest') lapply(pkgs, require, character.only = T) df1 <- read.sas7bdat("lgd.sas7bdat") df2 <- df1[which(df1$y < 1), ] xvar <- paste("x", 1:7, sep = '', collapse = " + ") fml1 <- as.formula(paste("y ~ ", xvar)) fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar)) # FIT A BETA MODEL WITH THE FIXED PHI beta1 <- be...
433 sym R (2644 sym/1 pcs) 4 img
Model Segmentation with Recursive Partitioning
library(party) df1 <- read.csv("credit_count.csv") df2 <- df1[df1$CARDHLDR == 1, ] mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel) print(mdl) #1) AGE <= 22.91667; criterion = 1, statistic = 48.255 # 2)* weights = 1116 ...
433 sym R (1686 sym/1 pcs) 4 img
Flexible Beta Modeling
library(betareg) library(sas7bdat) df1 <- read.sas7bdat('lgd.sas7bdat') df2 <- df1[df1$y < 1, ] fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2') ### LATENT-CLASS BETA REGRESSION: AIC = -565 ### mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1)) print(mdl1) #betamix(formula = fml, data ...
433 sym R (4604 sym/1 pcs) 4 img
Query Pandas DataFrame with SQL
Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames. Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub (https://github.com/yhat/pandasql). In ...
778 sym Python (1190 sym/1 pcs) 4 img
Download Federal Reserve Economic Data (FRED) with Python
In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python. In [1]: import pandas_datareader.data as web...
737 sym Python (2202 sym/1 pcs) 4 img
Model Segmentation with Cubist
Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data. pkgs <- c('MASS', 'Cubist', '...
752 sym R (2441 sym/1 pcs) 4 img