Publications by statcompute
Data Import Efficiency – A Case in R
Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency. > library(RSQLite) Loading required package: DBI > library(rhdf5) > df <- read.csv('credit_count.csv') > do.call(cat, list(nrow(df), ncol(df), '\n')) 13444 14 > > ...
616 sym R (837 sym/1 pcs) 4 img
Aggregation by Group in R
> df <- read.csv('credit_count.csv') > > # METHOD 1: USING AGGREGAGE() > summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean) > print(summ1) SELFEMPL OWNRENT INCOME BAD 1 0 0 2133.314 0.08470957 2 1 0 2742.247 0.06896552 3 0 1 2881.201 0.06293210 4 1 1 3487...
479 sym R (3074 sym/2 pcs) 4 img
More about Aggregation by Group in R
Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc, and gdata, and are extremely user-friendly. In terms of CPU time, while the method with summarize() is as efficient as the 2nd method with by() introduced yes...
917 sym R (2468 sym/2 pcs) 4 img
Surprising Performance of data.table in Data Aggregation
data.table (http://datatable.r-forge.r-project.org/) inherits from data.frame and provides functionality in fast subset, fast grouping, and fast joins. In previous posts, it is shown that the shortest CPU time to aggregate a data.frame with 13,444 rows and 14 columns for 10 times is 0.236 seconds with summarize() in Hmisc package. However, after ...
911 sym R (574 sym/1 pcs) 4 img
Modeling in R with Log Likelihood Function
Similar to NLMIXED procedure in SAS, optim() in R provides the functionality to estimate a model by specifying the log likelihood function explicitly. Below is a demo showing how to estimate a Poisson model by optim() and its comparison with glm() result. > df <- read.csv('credit_count.csv') > # ESTIMATE A POISSON MODEL WITH GLM() > mdl <- glm(M...
691 sym R (1783 sym/1 pcs) 4 img
Efficiecy of Extracting Rows from A Data Frame in R
In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the most efficient methods in terms of CPU time. Similar to the finding in my previous post, the method with data.table package is the most efficient solution with 0.64s CPU ti...
887 sym R (1469 sym/1 pcs) 4 img
PART – A Rule-Learning Algorithm
> require('RWeka') > require('pROC') > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv('credit_count.csv') > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <- sample(1:nrow(df2), nrow(df2) - 1000) > set1 <- df2[rows, ] > set2 <- df2[-rows, ] > > # BUILD A PART RULE MODEL > mdl1 <- PART(factor(BAD) ~., data = set1...
434 sym R (4139 sym/1 pcs) 4 img
Efficiency in Joining Two Data Frames
In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work. For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well ex...
1688 sym R (2632 sym/2 pcs) 4 img
Another Benchmark for Joining Two Data Frames
In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table i...
950 sym R (1304 sym/1 pcs) 4 img
A Grid Search for The Optimal Setting in Feed-Forward Neural Networks
The feed-forward neural network is a very powerful classification model in the machine learning content. Since the goodness-of-fit of a neural network is majorly dominated by the model complexity, it is very tempting for a modeler to over-parameterize the neural network by using too many hidden layers or/and hidden units. As pointed out by Brian...
2159 sym R (2295 sym/1 pcs) 4 img