Publications by Ben Scartz
XG-Boost - Toolkit
load("financial_data_pred.rda") # Load dataset # already into test and train XGBoost XGBoost (eXtreme Gradient Boosting) is an efficient implementation of gradient boosting. It can be used for both classification and regression. XGBoost works as follows: Create naive model Calculate the errors of the model Build a model predicting the errors/...
3036 sym 1 img
KMeans Clustering - Toolkit
K-means clustering sorts data into a specified number of clusters. The algorithm works by iteratively assigning each data point to the nearest cluster center and then recalculating the cluster centers. The process continues until the cluster centers no longer change. load("cfb_2021_off.rda") head(off_dat) ## teams Rush_freq_per_game Pass_fr...
1624 sym 2 img
Decision Trees
load("mh_dat.rda") # Load data into workspace head(mh_dat) # View first five rows ## Age self_employed family_history remote_work tech_company benefits ## 1 37 <NA> No No Yes Yes ## 2 44 <NA> No No No Don't know ## 3 32 <NA> No ...
1369 sym 1 img
Q1 Modeling
Load necessary packages and data # library(tidyverse) # library(xgboost) # library(caret) # library(xgboostExplainer) year1 <- read.csv('year1.csv') year2 <- read.csv('year2.csv') all_data <- rbind(year1, year2) Classify all rows as swing / no swing unique(all_data$description) ## [1] "ball" "foul" ## ...
2045 sym R (18989 sym/39 pcs) 3 img
Q3 Metrics
Question: Swing probability is the backbone of several other important metrics that compare individual players to league average. Create one such metric and aggregate it by player for Season 2. In 250 words, explain the metric methodology. Send the top 10 and bottom 10 players in the leaderboard for this metric. #library(tidyverse) #libary(xgb...
2774 sym R (7646 sym/17 pcs)
Lasso Regression - Toolkit
For this analysis we wish to determine which factors play a role in determining the life expectancy in different countries. For this we have gathered data from different countries, the life expectancy in the country for that year and some public health variables. load("life_expectancy.rda") Missing Data Imputation We can impute missing values u...
1498 sym
Logistic Regression - Toolkit
bank.df <- read.csv("UniversalBank.csv") bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns. # treat Education as categorical (R will create dummy variables) bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3), labels = c("Undergrad", "Graduate", "Advanced/Professional")) head(bank.df) ...
304 sym R (4881 sym/13 pcs)
Linear Regression (Toolkit)
For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will: Apply simple linear regression Apply multiple linear regression Create interaction terms in linear regression Interpret the output of linear regression Preliminary Steps We first need to load our ...
8008 sym R (10695 sym/37 pcs) 4 img
Linear Regression
Linear Regression For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will: Apply simple linear regression Apply multiple linear regression Create interaction terms in linear regression Interpret the output of linear regression Preliminary Steps We first...
12531 sym R (22866 sym/74 pcs) 14 img
Scouting Dashboard
This is the supporting code used to create the data for my scouting dashboard. The dashboard is meant to provide all of the information that I like to collect before scouting a pitcher. All of the information is derived from raw Statcast data. In the context of a pro scouting department, this code can be manipulated based on each scout’s pref...
1268 sym R (13247 sym/15 pcs) 2 img