Publications by Mircea Dumitru

Modeling Data in the Tidyverse

19.06.2024

1 Reading and Cleaning the Data library(tidyverse) library(tidymodels) library(here) library(janitor) train <- read_csv(here::here("data", "tidy_data", "data_complaints_train.csv")) library(skimr) #glimpse(train) #library(visdat) #vis_dat(train) #vis_miss(train) skim(train) Table 1.1: Data summary Name train Number of rows 90975 Number of co...

15627 sym R (30637 sym/53 pcs) 7 img 18 tbl

Practical Machine Learning: Quiz 4

22.08.2023

Question 1. Load the vowe.train and vowel.test data sets library(ElemStatLearn) data(vowel.train) data(vowel.test) Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit * a random forest predictor relating the factor variable \(y\) to the remaining variables * a boosted predictor using ‘gbm�...

1739 sym R (3460 sym/43 pcs) 1 img

Practical Machine Learning: Week 4

22.08.2023

Regularized Regression Basic Idea Fit a regression model Penalize (or shrink) large coefficients Pros: Can help with the bias/variance tradeoff Can help with model selection Cons: May be computionally demanding on large data sets Does not perform as well as random forests or boosting A motivating example Suppose a regression model with two...

10819 sym R (10278 sym/63 pcs) 16 img

Practical Machine Learning: Quiz 3

16.03.2023

Question 1 Load the cell segmentation data from the AppliedPredictiveModeling package using the commands: library(AppliedPredictiveModeling) data(segmentationOriginal) suppressMessages(library(caret)) Subset the data to a training set and testing set based on the Case variable in the data set. Set the seed to 125 and fit a CART model with the rpar...

2860 sym R (4488 sym/39 pcs) 2 img

Practical Machine Learning: Week 3

16.03.2023

Predicting with trees Key ideas Iteratively split variables into groups. Evaluate homogeneity within each group. Split again if necessary. Pros Easy to interpret. Better performances in non-linear settings. Cons Without prunning/cross-validation can lead to overfitting. Harder to estimate uncertainty. Results may be variable, depending the ex...

9586 sym R (13889 sym/67 pcs) 10 img

Practical Machine Learning: Quiz 2

10.03.2023

Question 1 Load the Alzheimer’s disease data using the commands: library(AppliedPredictiveModeling) data(AlzheimerDisease) Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each? Answer 1 library(caret) ## Loading required package: ggplot2 ## Loading required package...

1952 sym R (6838 sym/33 pcs) 2 img

Practical Machine Learning: Week 2

10.03.2023

The caret package The caret package (short for Classification And REgression Training) is a front end package that wraps around a lot of the prediction algorithms and tools in the R programming language. https://topepo.github.io/caret/ The package contains tools for: Preprocessing (cleaning) preProcess Data splitting createDataPartition create...

7319 sym R (27827 sym/163 pcs) 29 img 1 tbl

Regression Models: Generalized Linear Models - Week 4

04.03.2023

Module 11: GLMs The three most famous cases of GLMs are: * linear models, * binomial and binary regression * Poisson regression. Linear models Linear models are the most useful applied statistical technique. However, they are not without their limitations. The assumption of an additive response model is not justified if the response is discrete o...

11118 sym 12 img

Regression Models: Generalized Linear Models - Quiz 4

04.03.2023

Question 1 Consider the space shuttle data ?shuttle in the MASS library. Consider modeling the use of the autolander as the outcome (variable name use). Fit a logistic regression model with autolander (variable auto) use (labeled as “auto” 1) versus not (0) as predicted by wind sign (variable wind). Give the estimated odds ratio for autolander ...

1981 sym R (3486 sym/32 pcs) 1 img

Regression Models: Assignment

04.03.2023

Executive Summary Looking at the data set of a collection of cars mtcars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). The main questions addressed are: Is an automatic or manual transmission better for MPG? Quantify the MPG difference between automatic and manual transmissions? ...

6758 sym R (6556 sym/20 pcs) 2 img