Publications by inkhorn82
Sampling and the Analysis of Big Data
After my last post, I came across a few articles supporting the opinion that if you have a good reason to take random samples from a “big” dataset, you’re not committing some kind of sin: Big Data Blasphemy: Why Sample? To Sample or Not to Sample… Does it Even Matter? The moral of the story is that you can sample from “big data” so lo...
961 sym 16 img
Fun Editing R Graphs in Inkscape
Last week, I read a chapter out of Visualize This by Nathan Yau. I was, of course, delighted to see that he was championing the use of R. One really cool thing that I learned from his book, and was very surprised about, was that you can export an R graph in PDF form and then easily edit individual elements of the graph in Adobe Illustrator. ...
3434 sym 26 img
PostgreSQL, Excel, R, and a Really Big Data Set!
At work I’ve started to work with the biggest data set I’ve ever seen! First, let me qualify my use of the term “Big Data”. The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928. This will certainly pale in comparison to truly BIG ...
4053 sym 16 img
Projects in RStudio
Now that I have one enormous project on the go and one smaller one, I find it’s helping me considerably to have each project stored in separate RStudio projects. So, each project has its own scripting that I’ve been working on, its own extra variables or data frames that I’ve had to construct for them, and doesn’t take up more of my RAM...
835 sym 16 img
Guess who wins: apply() versus for loops in R
Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R. One of the tasks that I needed...
1779 sym 16 img
Mining for relations between nominal variables
The task today was to find what variables had significant relations with an important grouping variable in the big dataset I’ve been working with lately. The grouping variable has 3 levels, and represents different behaviours of interest. At first I tried putting the grouping variable as a dependent variable in a multinomial logistic regres...
1734 sym 16 img
Ack! Duplicates in the Data!
As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL. To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write. Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as tha...
2130 sym 18 img
Memory Management in R, and SOAR
The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer. Keep in mind I’ve got a quad core with 8 gigs of RAM. With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R. The best/easie...
3330 sym R (11 sym/1 pcs) 16 img
An embarrassing admission; Copy pasting tables with text containing spaces from Excel to R
I can’t believe I didn’t learn how to do it earlier, but I never knew how to accurately copy tables from excel that had text with spaces in them, and paste into a data frame in R without generating confusion around spaces representing different variables. Say you have a column title in a table in excel like “Group Size”. You then copy t...
1446 sym R (111 sym/1 pcs) 16 img
Functions ddply and melt make plotting summary stats in R more tolerable
The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function. One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me. Below I show t...
2160 sym R (920 sym/2 pcs) 20 img