Objective: Determine what night would be the best to run a marketing promotion to increase attendance.

Data source:
Dodgers Major League Baseball data from 2012
Data Source: dodgers.csv

After using Python to explore the Dodgers data and create some visualizations, I am now using R to create Regression models. Looking at the model summary results will help to determine which factors might be the best predictors of attendance.

# All variables in dataset
colnames(dodgers_data)
##  [1] "month"       "day"         "attend"      "day_of_week" "opponent"   
##  [6] "temp"        "skies"       "day_night"   "cap"         "shirt"      
## [11] "fireworks"   "bobblehead"

Splitting data into train and test sets

library(caTools)

# Ensuring random numbers are generated
set.seed(123) 

# Splitting data into 80/20 ratio
sample = sample.split(dodgers_data, SplitRatio = 0.20)

# Creating a training dataset and testing dataset
dodgers_train = subset(dodgers_data, sample==FALSE)
dodgers_test = subset(dodgers_data, sample==TRUE)

Regression models using training data

# Regression model using 'date'
dodgers_data_date_mod <- lm(attend ~ day, data = dodgers_train)
summary(dodgers_data_date_mod)
## 
## Call:
## lm(formula = attend ~ day, data = dodgers_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16918.7  -5913.0   -551.6   5877.6  15723.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39539.43    2023.84  19.537   <2e-16 ***
## day            96.28     110.15   0.874    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8536 on 65 degrees of freedom
## Multiple R-squared:  0.01162,    Adjusted R-squared:  -0.00359 
## F-statistic: 0.7639 on 1 and 65 DF,  p-value: 0.3853
# Regression model using 'day of week'
dodgers_data_dow_model <- lm(attend ~ day_of_week, data = dodgers_train)
summary(dodgers_data_dow_model)
## 
## Call:
## lm(formula = attend ~ day_of_week, data = dodgers_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15451.2  -4146.2   -534.2   4156.7  16480.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             40339       2287  17.639   <2e-16 ***
## day_of_weekMonday       -6169       3234  -1.907   0.0613 .  
## day_of_weekSaturday      3006       3234   0.930   0.3563    
## day_of_weekSunday        2301       3314   0.694   0.4901    
## day_of_weekThursday      1747       4940   0.354   0.7249    
## day_of_weekTuesday       7911       3234   2.446   0.0174 *  
## day_of_weekWednesday    -3250       3314  -0.981   0.3307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7585 on 60 degrees of freedom
## Multiple R-squared:  0.2796, Adjusted R-squared:  0.2076 
## F-statistic: 3.881 on 6 and 60 DF,  p-value: 0.002439
# Regression model using 'day of week' and 'month'
dodgers_data_dow_month_mod <- lm(attend ~ day_of_week + month, data = dodgers_train)
summary(dodgers_data_dow_month_mod)
## 
## Call:
## lm(formula = attend ~ day_of_week + month, data = dodgers_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11950.6  -4730.7   -516.7   3872.9  15644.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           37356.6     2743.1  13.618  < 2e-16 ***
## day_of_weekMonday     -6870.5     2873.6  -2.391  0.02032 *  
## day_of_weekSaturday    3272.7     2840.6   1.152  0.25434    
## day_of_weekSunday      1979.1     2906.2   0.681  0.49879    
## day_of_weekThursday    2921.9     4390.7   0.665  0.50858    
## day_of_weekTuesday     8074.8     2848.8   2.834  0.00644 ** 
## day_of_weekWednesday  -2736.5     2909.4  -0.941  0.35110    
## monthAUG               5126.8     2838.4   1.806  0.07645 .  
## monthJUL               7917.3     3032.6   2.611  0.01167 *  
## monthJUN               9576.6     3151.7   3.039  0.00366 ** 
## monthMAY              -1603.9     2769.3  -0.579  0.56488    
## monthOCT               -142.2     4455.4  -0.032  0.97465    
## monthSEP                295.5     2969.5   0.099  0.92111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6575 on 54 degrees of freedom
## Multiple R-squared:  0.5128, Adjusted R-squared:  0.4045 
## F-statistic: 4.736 on 12 and 54 DF,  p-value: 3.158e-05

Comparing performance of models

# Metrics for model 1 - 'date'
glance(dodgers_data_date_mod) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
# Metrics for model 2 - 'day of week'
glance(dodgers_data_dow_model) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
# Metrics for model 3 - 'day of week' and 'month'
glance(dodgers_data_dow_month_mod) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

Regression model using testing data

# Regression model using 'day of week' and 'month'
dodgers_data_dow_month_mod <- lm(attend ~ day_of_week + month, data = dodgers_test)
summary(dodgers_data_dow_month_mod)
## 
## Call:
## lm(formula = attend ~ day_of_week + month, data = dodgers_test)
## 
## Residuals:
##          1          2          3          4          5          6 
##  4.561e+03 -4.561e+03 -3.411e-13  6.629e+03 -6.629e+03 -4.547e-13 
##          7          8          9         10         11         12 
## -4.093e-12 -4.561e+03  4.561e+03  6.629e+03 -4.561e+03 -2.068e+03 
##         13         14 
##  2.068e+03 -2.068e+03 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           43344.4    14614.5   2.966   0.0974 .
## day_of_weekMonday       793.1    17493.7   0.045   0.9680  
## day_of_weekSaturday     269.7    12875.0   0.021   0.9852  
## day_of_weekSunday      -424.5    12109.4  -0.035   0.9752  
## day_of_weekThursday  -16146.9    17493.7  -0.923   0.4534  
## day_of_weekTuesday     8094.8    15565.1   0.520   0.6549  
## day_of_weekWednesday   5044.1    13597.5   0.371   0.7463  
## monthAUG              -3654.6    13830.0  -0.264   0.8163  
## monthJUL             -12994.3    13120.2  -0.990   0.4264  
## monthJUN              21808.6    20821.6   1.047   0.4048  
## monthMAY               -424.4    13597.5  -0.031   0.9779  
## monthSEP              -5245.1    13597.5  -0.386   0.7369  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11150 on 2 degrees of freedom
## Multiple R-squared:  0.6529, Adjusted R-squared:  -1.256 
## F-statistic: 0.342 on 11 and 2 DF,  p-value: 0.9041
# Metrics for model 3 - 'day of week' and 'month' (using testing data set)
glance(dodgers_data_dow_month_mod) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

Conclusion:

Based on the summaries of these regression models, month and day of the week are the best predictors of attendance at a Dodgers game in 2012. The model using day of the week and month (model 3) had the lowest AIC and BIC scores, the highest R2, and the lowest F-statistic p.value, making it more statistically significant. Running the same model using the testing data also proved successful, with a high R2 and low p.value. All in all, in order to improve attendance, it would be most effective to run a marketing promotion on Tuesdays in June. This way, youโ€™re most likely to reach the largest audience for the promotion.