Publications by David Smith

Sentiment analysis finds trouble in the Enron emails

24.05.2013

The Enron email dataset, collected during the FERC investigation of the Enron financial scandal, represents the largest publicly available set of emails. This makes theman ideal testbed for sentiment analysis algorithms. Ikanow's Andrew Strite used the open-source Infinit.e framework and a Hadoop cluster to generate sentiment scores for all of t...

1443 sym 2 img

Companies using Open Source R in 2013

28.05.2013

A recent quora post asked an interesting question: What interesting companies, open source projects are using R in 2013? Since we've been tracking applications of R here on the blog for a while now, I listed just some of the most recent examples: The New York Times routinely uses R for interactive and print data visualization. Google has more t...

1490 sym

Two forthcoming R books

29.05.2013

Today I learned about two forthcoming R books that I'm now looking forward to. The first is Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Max Kuhn is the author of the caret package, an extremely useful and powerful R package for fitting and optimizing all kinds of predictive models in R. It's available now on Amazon Kindle and wi...

1122 sym 4 img

The arteries of the world, in Tweets

31.05.2013

What happens when you plot billions of geotagged Tweets on a map? You can see the arteries of the world. Here's Europe: According to creator Miguel Rios (Engineering Manager, Data Visualization at Twitter), the dots on this chart represent every geotagged Tweet since 2009. The color represents number of tweets in the region, and the intensity ...

1142 sym 2 img

How to set up a reproducible R project

03.06.2013

If you're thinking about starting a project (for example, a report or paper) using the R language for analysis, the Nice R code blog has some great advice. Following the principles of reproducible research, Macquarie University postdocs Rich FitzJohn and Daniel Falster suggest: Creating a directory structure to separate R code, data, report...

1268 sym

KDNuggets 2013 software poll results

05.06.2013

The results of the 2013 KDNuggets software poll are in, with RapidMiner and R in a near-tie for first place. Of a record 1880 respondents, 737 reported using Rapid-I RapidMiner/RapidAnalytics, and 704 reported using R. Excel came in third: with 527 respondents, it was the lone commercial tool in the top 5. You can see the top 10 responses in the...

1189 sym 2 img

Crayfish or crawdad? Mapping US dialect variations with R

07.06.2013

I grew up in Australia, where I learned to speak English. Or so I thought: when I moved overseas to the UK, and especially when I moved to the States, I soon learned these are distinct cultures separated by a common language. Words which I previously had no context for being different anywhere else, such as “runners” (“sneakers”), “lemo...

2609 sym 6 img

In case you missed it: May 2013 Roundup

10.06.2013

In case you missed them, here are some articles from May of particular interest to R users: Billions of geotagged Tweets create a beautiful map of the world when plotted with the ggmap package. A review of Ryan Sheftel's talk at R/Finance, on how he uses R on the trading desk at Credit Suisse. Also, a quick take on some other talks at the meeti...

2993 sym

Thursday: Webinar on video game analytics

11.06.2013

Video games are big business today: Electronic Arts (EA) generated more than 4 billion dollars in revenue last year, and they're not even the biggest player on the block. In addition to big bucks, video games also generate Big Data: 50 terabytes per day at EA alone. So there's an obvious need to apply predictive analytics to these massive strea...

2149 sym

New book: Seamless R and C++ Integration with Rcpp

11.06.2013

A new book from Dirk Eddelbuettel, co-author of the Rcpp package, is now available. Seamless R and C++ Integration with Rcpp can be ordered from Springer and from Amazon. The book provides the first comprehensive introduction to Rcpp, the R package that makes it easy to integrate C++ code with R and speed up R code. If you haven't come across Rc...

899 sym 2 img