Publications by Tony Breyal

htmlToText(): Extracting Text from HTML via XPath

18.11.2011

Converting HTML to plain text usually involves stripping out the HTML tags whilst preserving the most basic of formatting. I wrote a function to do this which works as follows (code can be found on github): # load packages library(RCurl) library(XML) # assign input (could be a html file, a URL, html text, or some combination of all three is the...

3733 sym R (3937 sym/5 pcs) 6 img

source_https(): Sourcing an R Script from github over HTTPS

24.11.2011

The Objective I wanted to source R scripts hosted on my github repository for use in my blog (i.e. a github version of ?source). This would make it easier for anyone wishing to test out my code snippets on their own computers without having to manually go to my github repo and retrieve a series of R scripts themselves to make it run. The Proble...

2381 sym R (1643 sym/3 pcs) 6 img

outersect(): The opposite of R’s intersect() function

29.11.2011

The Objective To find the non-duplicated elements between two or more vectors (i.e. the ‘yellow sections of the diagram above) The Problem I needed the opposite of R’s intersect() function, an “outersect()“. The closest I found was setdiff() but the order of the input vectors produces different results, e.g. x = letters[1:3] #[1] "a" "b"...

1377 sym R (677 sym/4 pcs) 6 img

Installing Rcpp on Windows 7 for R and C++ integration

07.12.2011

Introduction Romain Francois presented an Rcpp solution on his blog to an old r-wiki optimisation challenge which I had also presented R solutions for previously on my blog. The Rcpp package provides a method for integrating R and C++. This allows for faster execution of an R project by recoding the slower R parts into C+ and thus providin...

6544 sym R (400 sym/1 pcs) 6 img

Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!

08.12.2011

Introduction The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here. The Objective To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.: # n = 4 [1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" ...

3644 sym R (1930 sym/6 pcs) 6 img

Unshorten (almost) any URL with R

13.12.2011

Introduction I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-hel...

2031 sym R (1151 sym/3 pcs) 8 img

Plotting Doctor Who Ratings (1963-2011) with R

03.01.2012

Introduction First day back to work after New Year celebrations and my brain doesn’t really want to think too much. So I went out for lunch and had a nice walk in the park. Still had 15 minutes to kill before my lunch break was over and so decided to kill some time with a quick web scraping exercise in R. Objective Download the last 49 years ...

3116 sym R (1809 sym/3 pcs) 8 img

R: Web Scraping R-bloggers Facebook Page

06.01.2012

Introduction R-bloggers.com is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from ...

7185 sym R (2900 sym/4 pcs) 8 img

R: A Quick Scrape of Top Grossing Films from boxofficemojo.com

13.01.2012

Introduction I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don’t know what kind of graphs I’d construct other than a simple barplot but figured I’d at least get the basics done and then if I feel ...

1424 sym R (2879 sym/3 pcs) 8 img

R: Stem (Pre-Processed) Text Blocks

24.08.2014

Objective I recently needed to stem every word in a block of text i.e. reduce each word to a root form. Problem The stemmer I was using would only stem the last word in each block of text e.g. require(SnowballC) wordStem('walk walks walked walking walker walkers', language = 'en') # [1] 'walk walks walked walking walker walk'; Solution I wrote ...

893 sym R (1051 sym/3 pcs) 4 img