Publications by Jeff Shamp

tidymodels_spam_ham

11.06.2020

Spam vs Ham - Tidymodels Remix Jeff Shamp 2020-06-10 Cleaning and Data Preparation First, we will import, clean, and process the data for classification. Datasets I found a large library of datasets here. Below is an excerpt from the site regarding a library of resources. The email spam messages are collected from: The ENRON email archive Th...

3894 sym R (5861 sym/17 pcs)

Data 621 - Mean Difference Test

31.08.2020

Mean Difference Testing Jeff Shamp 2020-08-31 Mean Difference Test - Introduction Often we run into data with a mixutre of numberical values and categorical values and it is important to consider both with rigor to extract insight. One approach that I like to use when exploring data is mean difference testing. That is to say, for some categorica...

4134 sym R (1623 sym/6 pcs) 2 img

Data 621-Simpsons Paradox

31.08.2020

Simpson’s Paradox Jeff Shamp 2020-08-31 Introduction Simpson’s paradox is a phenomenon that arises in data analysis, which if not considered may lead to spurious conclusions or misleading predictions. The general idea of the paradox is that a data set in general may appear to trend in one direction (positive or negative) but trend in the opp...

2590 sym R (3589 sym/11 pcs) 6 img

predicting senators voting patterns

03.09.2020

Regression Analysis for US Sentors Voting Patterns Jeff shamp 2020-09-03 Introduction Is the net vote share won by Trump per district in 2016 predictive of a Senator’s voting record with regards to the Trump agenda? Is the predicted agreement (with Trump) value generated by 538.com more predictive of a Senator’s actual voting record? 538 upd...

6465 sym R (7791 sym/25 pcs) 12 img

normality testing and transformation

04.09.2020

Normality Testing and Transformation Jeff Shamp 2020-09-03 Introduction We will be using the Boston Housing dataset for this blog post. This is the version of the data set that appeared in Kaggle and is used for open competitions. Read more about this data set here. library(tidyverse) Below is a sample of the data. Each row is an obsevarion of a...

1664 sym R (2829 sym/14 pcs) 12 img

Astronaut Regression Analysis

04.09.2020

Astronaut Dataset Regression Jeff Shamp 2020-09-03 Introduction I have been working on getting up to speed with Tidymodels since it as released last spring. For the fifth and final blog, let’s use tidymodels to compare a few regression models. This data set is from the good people at RStudio and it details space missions from 1960 to 2020. I�...

1890 sym R (4674 sym/21 pcs) 1 img

Shamp BxD assessment

21.09.2020

BxD Data Assessment Jeff Shamp 2020-09-21 Question 1 - SQL Using the data dictionary on the following page, please write a SQL query to get the top five most frequent felony charges for cases opened in the last three months. “Charge1a” in the cases table is assumed to be the primary key to “charge” in the charge codes table. This is usin...

3363 sym R (531 sym/1 pcs) 3 img

ccrb clean up

15.10.2020

CCRB Clean Up and EDA Jeff Shamp - BxD 2020-10-16 Raw Data This is the NYCLU raw .csv file from their Github. library(tidyverse) ccrb_df<- read.csv("https://raw.githubusercontent.com/new-york-civil-liberties-union/NYPD-Misconduct-Complaint-Database/master/CCRB_database_raw.csv") First let’s make a column for the officers full name, and turn th...

2318 sym R (10218 sym/13 pcs)

shamp_624_hw_wk3

22.02.2021

HW week 2 - 624 - Spring 2021 Jeff Shamp 2021-02-21 3.1-3.3 and 3.8 # Question HA 3.1 ## Question For the following series, find a appropriate Box-Cox transformation in order to stablize the variance. usnetelec usgdp mcopper enplanements Answer usnetelec First let’s visualize this data. We see that this is trending data, but there does not ...

4431 sym R (2816 sym/28 pcs) 14 img

HW1 - 624 - Shamp

08.02.2021

DATA 624 - HW1 Jeff Shamp 2021-02-08 Question 2.1 Questions Use the help function to explore what the series gold, woolyrnq and gas represent. a. Use autoplot() to plot each of these in separate plots. b. What is the frequency of each series? Hint: apply the frequency() function. c. Use which.max() to spot the outlier in the gold series. Which...

8999 sym R (2156 sym/56 pcs) 33 img