Publications by statcompute

Row Search in Parallel

28.09.2014

I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel. In the R code below, a comparison is done between the standard row search and the parallel row search with the FOREACH package. The result is very en...

909 sym R (638 sym/1 pcs) 4 img

Vector Search vs. Binary Search

01.10.2014

# REFERENCE: # user2014.stat.ucla.edu/files/tutorial_Matt.pdf pkgs <- c('data.table', 'rbenchmark') lapply(pkgs, require, character.only = T) load('2008.Rdata') dt <- data.table(data) benchmark(replications = 10, order = "elapsed", vector_search = { test1 <- dt[ArrTime == 1500 & Origin == 'ABE', ] }, binary_search = { setkey(dt,...

433 sym R (642 sym/1 pcs) 4 img

By-Group Aggregation in Parallel

04.10.2014

Similar to the row search, by-group aggregation is another perfect use case to demonstrate the power of split-and-conquer with parallelism. In the example below, it is shown that the homebrew by-group aggregation with foreach pakage, albeit inefficiently coded, is still a lot faster than the summarize() function in Hmisc package. load('2008.Rdat...

766 sym R (712 sym/1 pcs) 4 img

Fitting Lasso with Julia

07.10.2014

Julia Code using RDatasets, DataFrames, GLMNet data = dataset("MASS", "Boston"); y = array(data[:, 14]); x = array(data[:, 1:13]); cv = glmnetcv(x, y); cv.path.betas[:, indmin(cv.meanloss)]; result = DataFrame(); result[:Vars] = names(data)[1:13]; result[:Beta] = cv.path.betas[:, indmin(cv.meanloss)]; result # | Row | Vars | Beta | # ...

452 sym R (1309 sym/2 pcs) 4 img

Estimating a Beta Regression with The Variable Dispersion in R

19.10.2014

pkgs <- c('sas7bdat', 'betareg', 'lmtest') lapply(pkgs, require, character.only = T) df1 <- read.sas7bdat("lgd.sas7bdat") df2 <- df1[which(df1$y < 1), ] xvar <- paste("x", 1:7, sep = '', collapse = " + ") fml1 <- as.formula(paste("y ~ ", xvar)) fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar)) # FIT A BETA MODEL WITH THE FIXED PHI beta1 <- be...

433 sym R (2644 sym/1 pcs) 4 img

Model Segmentation with Recursive Partitioning

26.10.2014

library(party) df1 <- read.csv("credit_count.csv") df2 <- df1[df1$CARDHLDR == 1, ] mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel) print(mdl) #1) AGE <= 22.91667; criterion = 1, statistic = 48.255 # 2)* weights = 1116 ...

433 sym R (1686 sym/1 pcs) 4 img

Flexible Beta Modeling

27.10.2014

library(betareg) library(sas7bdat) df1 <- read.sas7bdat('lgd.sas7bdat') df2 <- df1[df1$y < 1, ] fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2') ### LATENT-CLASS BETA REGRESSION: AIC = -565 ### mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1)) print(mdl1) #betamix(formula = fml, data ...

433 sym R (4604 sym/1 pcs) 4 img

Query Pandas DataFrame with SQL

01.11.2014

Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames. Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub (https://github.com/yhat/pandasql). In ...

778 sym Python (1190 sym/1 pcs) 4 img

Download Federal Reserve Economic Data (FRED) with Python

10.12.2014

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python. In [1]: import pandas_datareader.data as web...

737 sym Python (2202 sym/1 pcs) 4 img

Model Segmentation with Cubist

18.03.2015

Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data. pkgs <- c('MASS', 'Cubist', '...

752 sym R (2441 sym/1 pcs) 4 img