OVERVIEW Fears of the coronavirus crashed the stock market back in February, precisely beginning on February 24, 2020. The pandemic sent a shockwave to the global market and it still continuously wreaked havoc to humanity. The fears spread quickly and globally, e.g. over 70% of the world population was under lockdown at some point in March. Rece...
Overview In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution c...
Time Series Analysis This is a time series analysis tutorial on building an ARIMA model. We will be using a simple dataset from hotel revenue industry. The original sample data has only four columns, i.e. date, room_sold, adr, and revenue. The “date” is referred to the historical record of check-in date of a hotel in NYC, whereas the “room...
Odds Ratio Often times, we have to deal with a lot of unclean, missing categorical data, and our goal is to extract key insights, features from various attributes to come up with some sort of customer profile. For example, imagine you have a data set that has only five variables, i.e. var_a is subject Id, var_b is gender, var_c is education, var...
Introduction This homework exercise is to build a logistic regression model and a multiple regression model that will estimate the likelihood of car accident, and if so, we try to predict the cost when such accidents happen. We have two response variables, i.e. TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary field where 1 is equal to crash, ...
Survival Model
Survival model This is a simple tutorial of building a survival model for a subscription business. The use case is that a media company offers various subscription plan to its customers. Each plan is associated with different price and billing period, e.g. Annual, Month, Semi-Annual, Two-Year, etc. We need to infer from billing period associated...
data_621 - Logistics Regression
load packages, data
Classification metrics exercise
df <- read.csv("classification-output-data.csv", header = TRUE) dfSubset <- df %>% dplyr::select(class, scored.class, scored.probability) rawConfusionMatrix <- with(dfSubset, table(scored.class, class)) rawConfusionMatrix ## class ## scored.class 0 1 ## 0 119 30 ## 1 5 27 The confusion matrix summar...
Data 621 - Moneyball (hw1)
library(tidyverse)
Logistic Regression Logistic regression is a very common tool to solve classification problem. Given a binary outcome, we would like to classify whether an event would occur based on a set of quantitative or qualitative variables. In this blog post, we would like to use a public dataset to classify loan status based on various social demographic ...
