Publications by Ben Scartz

XG-Boost - Toolkit

02.10.2024

load("financial_data_pred.rda") # Load dataset # already into test and train XGBoost XGBoost (eXtreme Gradient Boosting) is an efficient implementation of gradient boosting. It can be used for both classification and regression. XGBoost works as follows: Create naive model Calculate the errors of the model Build a model predicting the errors/...

3036 sym 1 img

KMeans Clustering - Toolkit

01.10.2024

K-means clustering sorts data into a specified number of clusters. The algorithm works by iteratively assigning each data point to the nearest cluster center and then recalculating the cluster centers. The process continues until the cluster centers no longer change. load("cfb_2021_off.rda") head(off_dat) ## teams Rush_freq_per_game Pass_fr...

1624 sym 2 img

Decision Trees

26.09.2024

load("mh_dat.rda") # Load data into workspace head(mh_dat) # View first five rows ## Age self_employed family_history remote_work tech_company benefits ## 1 37 <NA> No No Yes Yes ## 2 44 <NA> No No No Don't know ## 3 32 <NA> No ...

1369 sym 1 img

Q1 Modeling

11.05.2024

Load necessary packages and data # library(tidyverse) # library(xgboost) # library(caret) # library(xgboostExplainer) year1 <- read.csv('year1.csv') year2 <- read.csv('year2.csv') all_data <- rbind(year1, year2) Classify all rows as swing / no swing unique(all_data$description) ## [1] "ball" "foul" ## ...

2045 sym R (18989 sym/39 pcs) 3 img

Q3 Metrics

11.05.2024

Question: Swing probability is the backbone of several other important metrics that compare individual players to league average. Create one such metric and aggregate it by player for Season 2. In 250 words, explain the metric methodology. Send the top 10 and bottom 10 players in the leaderboard for this metric. #library(tidyverse) #libary(xgb...

2774 sym R (7646 sym/17 pcs)

Lasso Regression - Toolkit

07.03.2024

For this analysis we wish to determine which factors play a role in determining the life expectancy in different countries. For this we have gathered data from different countries, the life expectancy in the country for that year and some public health variables. load("life_expectancy.rda") Missing Data Imputation We can impute missing values u...

1498 sym

Logistic Regression - Toolkit

19.02.2024

bank.df <- read.csv("UniversalBank.csv") bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns. # treat Education as categorical (R will create dummy variables) bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3), labels = c("Undergrad", "Graduate", "Advanced/Professional")) head(bank.df) ...

304 sym R (4881 sym/13 pcs)

Linear Regression (Toolkit)

25.01.2024

For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will: Apply simple linear regression Apply multiple linear regression Create interaction terms in linear regression Interpret the output of linear regression Preliminary Steps We first need to load our ...

8008 sym R (10695 sym/37 pcs) 4 img

Linear Regression

17.01.2024

Linear Regression For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will: Apply simple linear regression Apply multiple linear regression Create interaction terms in linear regression Interpret the output of linear regression Preliminary Steps We first...

12531 sym R (22866 sym/74 pcs) 14 img

Scouting Dashboard

05.01.2024

This is the supporting code used to create the data for my scouting dashboard. The dashboard is meant to provide all of the information that I like to collect before scouting a pitcher. All of the information is derived from raw Statcast data. In the context of a pro scouting department, this code can be manipulated based on each scout’s pref...

1268 sym R (13247 sym/15 pcs) 2 img