Publications by Allan Engelhardt
R tips: Installing Rmpi on Fedora Linux
Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here. 1. In...
4900 sym R (2279 sym/8 pcs) 24 img
R tips: Determine if function is called from specific package
I like the “multicore” library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower le...
1864 sym R (552 sym/2 pcs) 18 img
OECD Statistics
I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription) Exports in multiple formats, including Excel, CSV, and SDMX. Related To leave a comment for the auth...
726 sym 14 img
The Knapsack Problem
David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution. See David’s blog for my earlier proposed solution with a very common error. ## http://blog.revolution-computing.com/2009/07/...
1190 sym R (592 sym/1 pcs) 18 img
Massively parallel database for analytics
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends ...
966 sym 14 img
Beautiful Data
O’Reilly’s recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download. I met Jeff a couple of year ago at an ETech conference, and he is easily one of the smartest people I have ever met who is thinking about data. ...
1121 sym 20 img
R: Eliminating observed values with zero variance
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a ...
3843 sym R (1659 sym/3 pcs) 24 img
Faster R through better BLAS
Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library....
5397 sym R (784 sym/6 pcs) 24 img 1 tbl
Comparing standard R with Revoutions for performance
Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages. For convenience I collected their tests into a single script revolution_benchmark.R that I can simp...
2800 sym 20 img 1 tbl
Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&P 500...
4460 sym R (3485 sym/1 pcs) 30 img