Publications by Rstats on Julia Silge

New sports from random emoji

24.11.2017

I love emoji ❤️ and I love xkcd, so this recent comic from Randall Munroe was quite a delight for me. I sat there, enjoying the thought of these new sports like horse hole and multiplayer avocado and I thought, “I can make more of these in just the barest handful of lines of code”. This is largely thanks to the emo package by Hadley Wick...

1825 sym R (469 sym/2 pcs) 2 img

Tidy word vectors, take 2!

26.11.2017

A few weeks ago, I wrote a post about finding word vectors using tidy data principles, based on an approach outlined by Chris Moody on the StitchFix tech blog. I’ve been pondering how to improve this approach, and whether it would be nice to wrap up some of these functions in a package, so here is an update! Like in my previous post, let’s do...

4181 sym R (6963 sym/8 pcs) 2 img

tidytext 0.1.6

09.01.2018

I am pleased to announce that tidytext 0.1.6 is now on CRAN! Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fac...

3345 sym R (5007 sym/4 pcs) 4 img

The game is afoot! Topic modeling of Sherlock Holmes stories

24.01.2018

In a recent release of tidytext, we added tidiers and support for building Structural Topic Models from the stm package. This is my current favorite implementation of topic modeling in R, so let’s walk through an example of how to get started with this kind of modeling, using The Adventures of Sherlock Holmes. via GIPHY You can wa...

909 sym

Stack Overflow questions around the world

10.04.2018

I am so lucky to work with so many generous, knowledgeable, and amazing people at Stack Overflow, including Ian Allen and Kirti Thorat. Both Ian and Kirti are part of biweekly sessions we have at Stack Overflow where several software developers join me in practicing R, data science, and modeling skills. This morning, the two of them w...

1046 sym

Understanding PCA using Stack Overflow data

17.05.2018

This year, I have given some talks about understanding principal component analysis using what I spend day in and day out with, Stack Overflow data. You can see a recording of one of these talks from rstudio::conf 2018. When I have given these talks, I’ve focused a lot on understanding PCA. This blog post walks through how I impleme...

804 sym

Public Data Release of Stack Overflow’s 2018 Developer Survey

29.05.2018

Note: Cross-posted with the Stack Overflow blog. Starting today, you can access the public data release for Stack Overflow’s 2018 Developer Survey. Over 100,000 developers from around the world shared their opinions about everything from their favorite technologies to job preferences, and this data is now available for you to analyz...

987 sym

Punctuation in literature

29.06.2018

This morning I was scrolling through Twitter and noticed Alberto Cairo share this lovely data visualization piece by Adam J. Calhoun about the varying prevalence of punctuation in literature. I thought, “I want to do that!” It also offers me the opportunity to chat about a few of the new options available for tokenizing in tidytex...

904 sym

Amazon Alexa and Accented English

18.07.2018

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Wa...

4037 sym R (3616 sym/8 pcs) 4 img 2 tbl

Training, evaluating, and interpreting topic models

07.09.2018

At the beginning of this year, I wrote a blog post about how to get started with the stm and tidytext packages for topic modeling. I have been doing more topic modeling in various projects, so I wanted to share some workflows I have found useful for training many topic models at one time, evaluating topic models and understanding model diagnosti...

6711 sym R (6903 sym/12 pcs) 6 img 1 tbl