Publications by Keith Colella
Fuzzy Join Vignette
library(tidyverse) library(fuzzyjoin) Overview This vignette will introduce the fuzzyjoin package, which enables joining of two datasets based on imperfect matches. This package is very helpful for combining data without unique keys. We will use data related to candidates running in the 2022 election for the House of Representatives. Specific...
6139 sym R (11045 sym/35 pcs)
Week 10 - NLP and “Text Mining with R”
Reperformance of Textbook Exercises We begin by re-performing the text mining and sentiment analysis from Chapter 2 of Silge and Robinson’s “Text Mining with R” (https://www.tidytextmining.com/). I’ve directly leveraged the code and snippets of explanatory text from their book. In Part 2, I’ll complete the assignment by extending the an...
5013 sym R (8599 sym/45 pcs) 5 img
Week 10 - NLP and Congressional Candidate Tweets
Assignment We’ve reviewed the sentiment analysis from Chapter 2 of Silge and Robinson’s “Text Mining with R” (https://www.tidytextmining.com/). Now, I’ll perform similar analysis on another corpus. The corpus I’ve chosen is a collection of ~280,000 tweets from 424 congressional candidates from the 2022 election cycle (hosted here on ...
6127 sym R (4713 sym/13 pcs) 5 img 5 tbl
Data607 - Week 9 - Web APIs
Assignment Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame. Setup In addition to standard tidyverse usage, we’ll leverage the jsonlite for querying the API, lubridate for formatting dates, and kableExtra to display our results. library(tidyverse) li...
2015 sym R (3298 sym/10 pcs) 5 tbl
Week 7 - Web Technologies
library(tidyverse) library(rvest) library(xml2) library(jsonlite) HTML Read in the file html <- read_html('https://raw.githubusercontent.com/kac624/cuny/main/D607/data/week7_books.html') Explore html %>% html_elements('title') ## {xml_nodeset (1)} ## [1] <title>This page has a table for D607.</title> html %>% html_elements('td') ## {xml_nodese...
183 sym R (3043 sym/19 pcs) 3 tbl
Project 2 - Dataset 2
library(tidyverse) library(reshape2) Read in CSV TBD data <- read_csv('https://github.com/jwilber/Bob_Ross_Paintings/raw/master/data/bob_ross_paintings.csv') ## New names: ## Rows: 403 Columns: 28 ## ── Column specification ## ──────────────────────────────────────...
26 sym R (1825 sym/5 pcs)
Project 2 - Dataset 3
library(tidyverse) library(reshape2) Read in CSV TBD data <- read_csv('https://github.com/kac624/cuny/raw/main/D607/data/healthcare_empl.csv', skip = 3) ## New names: ## Rows: 48 Columns: 15 ## ── Column specification ## ──────────────────────────────────...
28 sym R (2590 sym/7 pcs)
Project 2 - Dataset 1
library(tidyverse) library(arrow) library(lubridate) library(sf) library(cowplot) Introduction and Exploratory Data Analysis I’ll focus on a massive dataset detailing all taxis rides in New York City since 2009. The data is maintained by the NYC government at the following site. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page ...
6250 sym R (12886 sym/20 pcs) 2 img
Extra Credit - ELO Calculations
library(tidyverse) Assignment Based on difference in ratings between the chess players and each of their opponents in our Project 1 tournament, calculate each player’s expected score (e.g. 4.3) and the difference from their actual score (e.g 4.0). List the five players who most overperformed relative to their expected score, and the five pl...
2154 sym R (3918 sym/11 pcs) 1 img
Week 5 - Tidying and Transforming Data
library(tidyverse) library(reshape2) library(scales) Read in CSV First, I’ll read in the data from github. The flights data comes in .csv format, formatted exactly as provided in the assignment. data <- read_csv('https://raw.githubusercontent.com/kac624/cuny/main/D607/data/week5_flights.csv') ## New names: ## Rows: 5 Columns: 7 ## ── Co...
2324 sym R (3505 sym/12 pcs) 3 img