Publications by inkhorn82

Sampling and the Analysis of Big Data

08.04.2012

After my last post, I came across a few articles supporting the opinion that if you have a good reason to take random samples from a “big” dataset, you’re not committing some kind of sin: Big Data Blasphemy: Why Sample? To Sample or Not to Sample… Does it Even Matter? The moral of the story is that you can sample from “big data” so lo...

961 sym 16 img

Fun Editing R Graphs in Inkscape

12.04.2012

Last week, I read a chapter out of Visualize This by Nathan Yau.  I was, of course, delighted to see that he was championing the use of R.  One really cool thing that I learned from his book, and was very surprised about, was that you can export an R graph in PDF form and then easily edit individual elements of the graph in Adobe Illustrator. ...

3434 sym 26 img

PostgreSQL, Excel, R, and a Really Big Data Set!

19.04.2012

At work I’ve started to work with the biggest data set I’ve ever seen!  First, let me qualify my use of the term “Big Data”.  The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928.  This will certainly pale in comparison to truly BIG ...

4053 sym 16 img

Projects in RStudio

24.04.2012

Now that I have one enormous project on the go and one smaller one, I find it’s helping me considerably to have each project stored in separate RStudio projects.  So, each project has its own scripting that I’ve been working on, its own extra variables or data frames that I’ve had to construct for them, and doesn’t take up more of my RAM...

835 sym 16 img

Guess who wins: apply() versus for loops in R

28.04.2012

Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R. One of the tasks that I needed...

1779 sym 16 img

Mining for relations between nominal variables

01.05.2012

The task today was to find what variables had significant relations with an important grouping variable in the big dataset I’ve been working with lately.  The grouping variable has 3 levels, and represents different behaviours of interest.  At first I tried putting the grouping variable as a dependent variable in a multinomial logistic regres...

1734 sym 16 img

Ack! Duplicates in the Data!

03.05.2012

As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL.  To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write.  Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as tha...

2130 sym 18 img

Memory Management in R, and SOAR

08.05.2012

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer.  Keep in mind I’ve got a quad core with 8 gigs of RAM.  With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R. The best/easie...

3330 sym R (11 sym/1 pcs) 16 img

An embarrassing admission; Copy pasting tables with text containing spaces from Excel to R

11.05.2012

I can’t believe I didn’t learn how to do it earlier, but I never knew how to accurately copy tables from excel that had text with spaces in them, and paste into a data frame in R without generating confusion around spaces representing different variables. Say you have a column title in a table in excel like “Group Size”.  You then copy t...

1446 sym R (111 sym/1 pcs) 16 img

Functions ddply and melt make plotting summary stats in R more tolerable

15.05.2012

The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function.  One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me.  Below I show t...

2160 sym R (920 sym/2 pcs) 20 img