Publications by Rstats on Julia Silge

TensorFlow, Jane Austen, and Text Generation

03.10.2018

I remember the first time I saw a deep learning text generation project that was truly compelling and delightful to me. It was in 2016 when Andy Herd generated new Friends scenes by training a recurrent neural network on all the show’s episodes. Herd’s work went pretty viral at the time and I thought: via GIPHY And also: via GIPHY At the ...

5772 sym R (4572 sym/9 pcs) 1 tbl

Word associations from the Small World of Words

15.12.2018

Do you subscribe to the Data is Plural newsletter from Jeremy Singer-Vine? You probably should, because it is a treasure trove of interesting datasets arriving in your email inbox. In the November 28 edition, Jeremy linked to the Small World of Words project, and I was entranced. I love stuff like that, all about words and how people think of the...

5429 sym R (9742 sym/12 pcs) 14 img

Text classification with tidy data principles

23.12.2018

I am an enthusiastic proponent of using tidy data principles for dealing with text data. This kind of approach offers a fluent and flexible option not just for exploratory data analysis, but also for machine learning for text, including both unsupervised machine learning and supervised machine learning. I haven’t written much about supervised m...

7805 sym R (8347 sym/17 pcs) 10 img

Feeling the rstudio::conf ❤️

19.01.2019

I am heading home from my third year of attending rstudio::conf! If you weren’t there, watch for the videos to be released so you can check out the talks; I know I will do the same so I can see the talks I was forced to miss by scheduling constraints. I love this conference, and once again this year, the organizers have succeeded in building an...

9435 sym

Read all about it! Navigating the R Package Universe

23.02.2019

In the most recent issue of the R Journal, I have a new paper out with coauthors John Nash and Spencer Graves. Check out the abstract: Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We...

2618 sym

Writing a letter to DataCamp

15.04.2019

Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning. About two weeks ago, DataCamp published ...

6791 sym

Relaunching the qualtRics package

29.04.2019

Note: cross-posted with the rOpenSci blog. rOpenSci is one of the first organizations in the R community I ever interacted with, when I participated in the 2016 rOpenSci unconf. I have since reviewed several rOpenSci packages and been so happy to be connected to this community, but I have never submitted or maintained a package myself. All that c...

6564 sym

Fixing your mistakes: sentiment analysis edition

13.06.2019

Today tidytext 0.2.1 is available on CRAN! This new release of tidytext has a collection of nice new features. Bug squashing! ???? Improvements to error messages and documentation ???? Switching from broom to generics for lighter dependencies Addition of some helper plotting functions I look forward to blogging about soon An additional change i...

4213 sym 2 img

Reordering and facetting for ggplot2

30.06.2019

I recently wrote about the release of tidytext 0.2.1, and one of the most useful new features in this release is a couple of helper functions for making plots with ggplot2. These helper functions address a class of challenges that often arises when dealing with text data, so we’ve included them in the tidytext package. Let’s work through an ...

3371 sym R (2269 sym/4 pcs) 8 img

Introducing tidylo

07.07.2019

Today I am so pleased to introduce a new package for calculating weighted log odds ratios, tidylo. Often in data analysis, we want to measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents. One statistic often used to find these kinds of differences in text data is tf-idf. Another op...

4140 sym R (2543 sym/5 pcs) 4 img