Publications by Kevin Markham
Tidying “messy data” in R
I watched Hadley Wickham’s excellent talk on tidy data and tidy tools, and decided to use this as an opportunity to learn about a few of his R packages. (In case you’re unfamiliar with Hadley, he is well-known for his contributions to the R ecosystem, most notably ggplot2; he is also the Chief Scientist for RStudio.) The principles of tidy da...
2058 sym
Example of linear regression and regularization in R
When getting started in machine learning, it’s often helpful to see a worked example of a real-world problem from start to finish. But it can be hard to find an example with the “right” level of complexity for a novice. Here’s what I look for: uses real-world data, not artificially simple data demonstrates multiple models on the same dat...
2494 sym
Hands-on dplyr tutorial for faster data manipulation in R
I love dplyr. It’s my “go-to” package in R for data exploration, data manipulation, and feature engineering. I use dplyr because it saves me time: its performance is blazing fast on data frames, but even more importantly, I can write dplyr code faster than base R code. Its syntax is intuitive and its functions are well-named, an...
3648 sym 1 img
In-depth introduction to machine learning in 15 hours of expert videos
In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (als...
5724 sym 2 img
Should you teach Python or R for data science?
Last week, I published a post titled Lessons learned from teaching an 11-week data science course, detailing my experiences and recommendations from teaching General Assembly’s 66-hour introductory data science course. In the comments, I received the following question: I’m part of a team developing a course, with NSF support, in data scienc...
10382 sym
Going deeper with dplyr: New features in 0.3 and 0.4 (video tutorial)
In August 2014, I created a 40-minute video tutorial introducing the key functionality of the dplyr package in R. dplyr continues to be my “go-to” package for data exploration and manipulation because of its intuitive syntax, blazing fast performance, and excellent documentation. I recorded that tutorial using the latest version at the time (...
2972 sym