Publications by Rstats on Julia Silge
TensorFlow, Jane Austen, and Text Generation
I remember the first time I saw a deep learning text generation project that was truly compelling and delightful to me. It was in 2016 when Andy Herd generated new Friends scenes by training a recurrent neural network on all the show’s episodes. Herd’s work went pretty viral at the time and I thought: via GIPHY And also: via GIPHY At the ...
5772 sym R (4572 sym/9 pcs) 1 tbl
Word associations from the Small World of Words
Do you subscribe to the Data is Plural newsletter from Jeremy Singer-Vine? You probably should, because it is a treasure trove of interesting datasets arriving in your email inbox. In the November 28 edition, Jeremy linked to the Small World of Words project, and I was entranced. I love stuff like that, all about words and how people think of the...
5429 sym R (9742 sym/12 pcs) 14 img
Text classification with tidy data principles
I am an enthusiastic proponent of using tidy data principles for dealing with text data. This kind of approach offers a fluent and flexible option not just for exploratory data analysis, but also for machine learning for text, including both unsupervised machine learning and supervised machine learning. I haven’t written much about supervised m...
7805 sym R (8347 sym/17 pcs) 10 img
Feeling the rstudio::conf ❤️
I am heading home from my third year of attending rstudio::conf! If you weren’t there, watch for the videos to be released so you can check out the talks; I know I will do the same so I can see the talks I was forced to miss by scheduling constraints. I love this conference, and once again this year, the organizers have succeeded in building an...
9435 sym
Read all about it! Navigating the R Package Universe
In the most recent issue of the R Journal, I have a new paper out with coauthors John Nash and Spencer Graves. Check out the abstract: Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We...
2618 sym
Writing a letter to DataCamp
Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning. About two weeks ago, DataCamp published ...
6791 sym
Relaunching the qualtRics package
Note: cross-posted with the rOpenSci blog. rOpenSci is one of the first organizations in the R community I ever interacted with, when I participated in the 2016 rOpenSci unconf. I have since reviewed several rOpenSci packages and been so happy to be connected to this community, but I have never submitted or maintained a package myself. All that c...
6564 sym
Fixing your mistakes: sentiment analysis edition
Today tidytext 0.2.1 is available on CRAN! This new release of tidytext has a collection of nice new features. Bug squashing! ???? Improvements to error messages and documentation ???? Switching from broom to generics for lighter dependencies Addition of some helper plotting functions I look forward to blogging about soon An additional change i...
4213 sym 2 img
Reordering and facetting for ggplot2
I recently wrote about the release of tidytext 0.2.1, and one of the most useful new features in this release is a couple of helper functions for making plots with ggplot2. These helper functions address a class of challenges that often arises when dealing with text data, so we’ve included them in the tidytext package. Let’s work through an ...
3371 sym R (2269 sym/4 pcs) 8 img
Introducing tidylo
Today I am so pleased to introduce a new package for calculating weighted log odds ratios, tidylo. Often in data analysis, we want to measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents. One statistic often used to find these kinds of differences in text data is tf-idf. Another op...
4140 sym R (2543 sym/5 pcs) 4 img