Data source:
Dodgers Major League Baseball data from 2012
Data Source: dodgers.csv
After using Python to explore the Dodgers data and create some visualizations, I am now using R to create Regression models. Looking at the model summary results will help to determine which factors might be the best predictors of attendance.
# All variables in dataset
colnames(dodgers_data)
## [1] "month" "day" "attend" "day_of_week" "opponent"
## [6] "temp" "skies" "day_night" "cap" "shirt"
## [11] "fireworks" "bobblehead"
library(caTools)
# Ensuring random numbers are generated
set.seed(123)
# Splitting data into 80/20 ratio
sample = sample.split(dodgers_data, SplitRatio = 0.20)
# Creating a training dataset and testing dataset
dodgers_train = subset(dodgers_data, sample==FALSE)
dodgers_test = subset(dodgers_data, sample==TRUE)
# Regression model using 'date'
dodgers_data_date_mod <- lm(attend ~ day, data = dodgers_train)
summary(dodgers_data_date_mod)
##
## Call:
## lm(formula = attend ~ day, data = dodgers_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16918.7 -5913.0 -551.6 5877.6 15723.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39539.43 2023.84 19.537 <2e-16 ***
## day 96.28 110.15 0.874 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8536 on 65 degrees of freedom
## Multiple R-squared: 0.01162, Adjusted R-squared: -0.00359
## F-statistic: 0.7639 on 1 and 65 DF, p-value: 0.3853
# Regression model using 'day of week'
dodgers_data_dow_model <- lm(attend ~ day_of_week, data = dodgers_train)
summary(dodgers_data_dow_model)
##
## Call:
## lm(formula = attend ~ day_of_week, data = dodgers_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15451.2 -4146.2 -534.2 4156.7 16480.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40339 2287 17.639 <2e-16 ***
## day_of_weekMonday -6169 3234 -1.907 0.0613 .
## day_of_weekSaturday 3006 3234 0.930 0.3563
## day_of_weekSunday 2301 3314 0.694 0.4901
## day_of_weekThursday 1747 4940 0.354 0.7249
## day_of_weekTuesday 7911 3234 2.446 0.0174 *
## day_of_weekWednesday -3250 3314 -0.981 0.3307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7585 on 60 degrees of freedom
## Multiple R-squared: 0.2796, Adjusted R-squared: 0.2076
## F-statistic: 3.881 on 6 and 60 DF, p-value: 0.002439
# Regression model using 'day of week' and 'month'
dodgers_data_dow_month_mod <- lm(attend ~ day_of_week + month, data = dodgers_train)
summary(dodgers_data_dow_month_mod)
##
## Call:
## lm(formula = attend ~ day_of_week + month, data = dodgers_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11950.6 -4730.7 -516.7 3872.9 15644.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37356.6 2743.1 13.618 < 2e-16 ***
## day_of_weekMonday -6870.5 2873.6 -2.391 0.02032 *
## day_of_weekSaturday 3272.7 2840.6 1.152 0.25434
## day_of_weekSunday 1979.1 2906.2 0.681 0.49879
## day_of_weekThursday 2921.9 4390.7 0.665 0.50858
## day_of_weekTuesday 8074.8 2848.8 2.834 0.00644 **
## day_of_weekWednesday -2736.5 2909.4 -0.941 0.35110
## monthAUG 5126.8 2838.4 1.806 0.07645 .
## monthJUL 7917.3 3032.6 2.611 0.01167 *
## monthJUN 9576.6 3151.7 3.039 0.00366 **
## monthMAY -1603.9 2769.3 -0.579 0.56488
## monthOCT -142.2 4455.4 -0.032 0.97465
## monthSEP 295.5 2969.5 0.099 0.92111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6575 on 54 degrees of freedom
## Multiple R-squared: 0.5128, Adjusted R-squared: 0.4045
## F-statistic: 4.736 on 12 and 54 DF, p-value: 3.158e-05
# Metrics for model 1 - 'date'
glance(dodgers_data_date_mod) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
# Metrics for model 2 - 'day of week'
glance(dodgers_data_dow_model) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
# Metrics for model 3 - 'day of week' and 'month'
glance(dodgers_data_dow_month_mod) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
# Regression model using 'day of week' and 'month'
dodgers_data_dow_month_mod <- lm(attend ~ day_of_week + month, data = dodgers_test)
summary(dodgers_data_dow_month_mod)
##
## Call:
## lm(formula = attend ~ day_of_week + month, data = dodgers_test)
##
## Residuals:
## 1 2 3 4 5 6
## 4.561e+03 -4.561e+03 -3.411e-13 6.629e+03 -6.629e+03 -4.547e-13
## 7 8 9 10 11 12
## -4.093e-12 -4.561e+03 4.561e+03 6.629e+03 -4.561e+03 -2.068e+03
## 13 14
## 2.068e+03 -2.068e+03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43344.4 14614.5 2.966 0.0974 .
## day_of_weekMonday 793.1 17493.7 0.045 0.9680
## day_of_weekSaturday 269.7 12875.0 0.021 0.9852
## day_of_weekSunday -424.5 12109.4 -0.035 0.9752
## day_of_weekThursday -16146.9 17493.7 -0.923 0.4534
## day_of_weekTuesday 8094.8 15565.1 0.520 0.6549
## day_of_weekWednesday 5044.1 13597.5 0.371 0.7463
## monthAUG -3654.6 13830.0 -0.264 0.8163
## monthJUL -12994.3 13120.2 -0.990 0.4264
## monthJUN 21808.6 20821.6 1.047 0.4048
## monthMAY -424.4 13597.5 -0.031 0.9779
## monthSEP -5245.1 13597.5 -0.386 0.7369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11150 on 2 degrees of freedom
## Multiple R-squared: 0.6529, Adjusted R-squared: -1.256
## F-statistic: 0.342 on 11 and 2 DF, p-value: 0.9041
# Metrics for model 3 - 'day of week' and 'month' (using testing data set)
glance(dodgers_data_dow_month_mod) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
Based on the summaries of these regression models, month and day of the week are the best predictors of attendance at a Dodgers game in 2012. The model using day of the week and month (model 3) had the lowest AIC and BIC scores, the highest R2, and the lowest F-statistic p.value, making it more statistically significant. Running the same model using the testing data also proved successful, with a high R2 and low p.value. All in all, in order to improve attendance, it would be most effective to run a marketing promotion on Tuesdays in June. This way, youโre most likely to reach the largest audience for the promotion.