Publications by statcompute

Data Import Efficiency – A Case in R

23.12.2012

Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency. > library(RSQLite) Loading required package: DBI > library(rhdf5) > df <- read.csv('credit_count.csv') > do.call(cat, list(nrow(df), ncol(df), '\n')) 13444 14 > > ...

616 sym R (837 sym/1 pcs) 4 img

Aggregation by Group in R

23.12.2012

> df <- read.csv('credit_count.csv') > > # METHOD 1: USING AGGREGAGE() > summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean) > print(summ1) SELFEMPL OWNRENT INCOME BAD 1 0 0 2133.314 0.08470957 2 1 0 2742.247 0.06896552 3 0 1 2881.201 0.06293210 4 1 1 3487...

479 sym R (3074 sym/2 pcs) 4 img

More about Aggregation by Group in R

24.12.2012

Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc, and gdata, and are extremely user-friendly. In terms of CPU time, while the method with summarize() is as efficient as the 2nd method with by() introduced yes...

917 sym R (2468 sym/2 pcs) 4 img

Surprising Performance of data.table in Data Aggregation

28.12.2012

data.table (http://datatable.r-forge.r-project.org/) inherits from data.frame and provides functionality in fast subset, fast grouping, and fast joins. In previous posts, it is shown that the shortest CPU time to aggregate a data.frame with 13,444 rows and 14 columns for 10 times is 0.236 seconds with summarize() in Hmisc package. However, after ...

911 sym R (574 sym/1 pcs) 4 img

Modeling in R with Log Likelihood Function

30.12.2012

Similar to NLMIXED procedure in SAS, optim() in R provides the functionality to estimate a model by specifying the log likelihood function explicitly. Below is a demo showing how to estimate a Poisson model by optim() and its comparison with glm() result. > df <- read.csv('credit_count.csv') > # ESTIMATE A POISSON MODEL WITH GLM() > mdl <- glm(M...

691 sym R (1783 sym/1 pcs) 4 img

Efficiecy of Extracting Rows from A Data Frame in R

01.01.2013

In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the most efficient methods in terms of CPU time. Similar to the finding in my previous post, the method with data.table package is the most efficient solution with 0.64s CPU ti...

887 sym R (1469 sym/1 pcs) 4 img

PART – A Rule-Learning Algorithm

11.01.2013

> require('RWeka') > require('pROC') > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv('credit_count.csv') > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <- sample(1:nrow(df2), nrow(df2) - 1000) > set1 <- df2[rows, ] > set2 <- df2[-rows, ] > > # BUILD A PART RULE MODEL > mdl1 <- PART(factor(BAD) ~., data = set1...

434 sym R (4139 sym/1 pcs) 4 img

Efficiency in Joining Two Data Frames

28.01.2013

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work. For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well ex...

1688 sym R (2632 sym/2 pcs) 4 img

Another Benchmark for Joining Two Data Frames

29.01.2013

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table i...

950 sym R (1304 sym/1 pcs) 4 img

A Grid Search for The Optimal Setting in Feed-Forward Neural Networks

03.02.2013

The feed-forward neural network is a very powerful classification model in the machine learning content. Since the goodness-of-fit of a neural network is majorly dominated by the model complexity, it is very tempting for a modeler to over-parameterize the neural network by using too many hidden layers or/and hidden units. As pointed out by Brian...

2159 sym R (2295 sym/1 pcs) 4 img