Row Search in Parallel


I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel. In the R code below, a comparison is done between the standard row search and the parallel row search with the FOREACH package. The result is very en...

Vector Search vs. Binary Search


# REFERENCE: # pkgs <- c('data.table', 'rbenchmark') lapply(pkgs, require, character.only = T) load('2008.Rdata') dt <- data.table(data) benchmark(replications = 10, order = "elapsed", vector_search = { test1 <- dt[ArrTime == 1500 & Origin == 'ABE', ] }, binary_search = { setkey(dt,...

By-Group Aggregation in Parallel


Similar to the row search, by-group aggregation is another perfect use case to demonstrate the power of split-and-conquer with parallelism. In the example below, it is shown that the homebrew by-group aggregation with foreach pakage, albeit inefficiently coded, is still a lot faster than the summarize() function in Hmisc package. load('2008.Rdat...

Fitting Lasso with Julia


Julia Code using RDatasets, DataFrames, GLMNet data = dataset("MASS", "Boston"); y = array(data[:, 14]); x = array(data[:, 1:13]); cv = glmnetcv(x, y); cv.path.betas[:, indmin(cv.meanloss)]; result = DataFrame(); result[:Vars] = names(data)[1:13]; result[:Beta] = cv.path.betas[:, indmin(cv.meanloss)]; result # | Row | Vars | Beta | # ...

Estimating a Beta Regression with The Variable Dispersion in R


pkgs <- c('sas7bdat', 'betareg', 'lmtest') lapply(pkgs, require, character.only = T) df1 <- read.sas7bdat("lgd.sas7bdat") df2 <- df1[which(df1$y < 1), ] xvar <- paste("x", 1:7, sep = '', collapse = " + ") fml1 <- as.formula(paste("y ~ ", xvar)) fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar)) # FIT A BETA MODEL WITH THE FIXED PHI beta1 <- be...

Model Segmentation with Recursive Partitioning


library(party) df1 <- read.csv("credit_count.csv") df2 <- df1[df1$CARDHLDR == 1, ] mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel) print(mdl) #1) AGE <= 22.91667; criterion = 1, statistic = 48.255 # 2)* weights = 1116 ...

Flexible Beta Modeling


library(betareg) library(sas7bdat) df1 <- read.sas7bdat('lgd.sas7bdat') df2 <- df1[df1$y < 1, ] fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2') ### LATENT-CLASS BETA REGRESSION: AIC = -565 ### mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1)) print(mdl1) #betamix(formula = fml, data ...

Query Pandas DataFrame with SQL


Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames. Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub ( In ...

Download Federal Reserve Economic Data (FRED) with Python


In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python. In [1]: import as web...

Model Segmentation with Cubist


Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package ( Below is a demonstrate of cubist() model with the classic Boston housing data. pkgs <- c('MASS', 'Cubist', '...

