File: Project-HorseRacing-HongKong.ipynb
Name: Corinne Medeiros
Date: 5/30/20
Description: Horse Racing in Hong Kong - Graph Analysis, Dimensionality and Feature Reduction, Model Evaluation and Selection to predict which horses will win.

Analyzing Hong Kong horse racing data to predict which horses will win

horse race

Photo by Mathew Schwartz on Unsplash

Narrative:

For this project, I’m using Hong Kong horse racing data from Kaggle.com (https://www.kaggle.com/gdaley/hkracing) to predict which kinds of horses win races. Factors to be considered are the horse’s age, weight, type, and country of origin. The type variable is comprised of the sex and age-related categories of a horse, specifically 'Gelding', 'Mare', 'Horse', 'Rig', 'Colt', and 'Filly' (Daley, 2019).

Horse racing is a giant industry in Hong Kong, with “betting pools bigger than all US racetracks combined” (Daley, 2019). Predicting wins could potentially lead to major financial gain for those interested in placing bets. Although I don’t necessarily condone horse racing, by analyzing the data I can hopefully bring more awareness to the subject and encourage discussions about it.

Part 1: Graph Analysis

Load and preview data

Observations so far

a. Won is represented as a 1 (won) or 0 (otherwise)
b. Missing data is represented as “NaN”
c. The Won variable will be the “target” and the other variables will be the “features”

Data cleanup & summaries

Questions that might help predict which horses will win:

a. What do the variables look like? For example, are they numerical or categorical data? If they are numerical, what are their distribution; if they are categorical, how many are there in different categories?

b. Are the numerical variables correlated?

c. Is the winning rate different for different types of horses? For example, were horses more likely to win if they were younger, or a gelding vs. a filly?

d. Are there different winning rates for different countries? For example, did more horses from Australia win than horses from New Zealand?

Data summary information

Conclusions based on data summaries

Looking at the descriptive summary information about the data, I can tell that most race horses are of a certain age because of the similar percentiles, so there won't be much variety there. I can also conclude that horse type and country have a smaller amount of unique values that will be fitting for bar charts.

Data Visualization: Histograms

At the start of the race, the majority of horses have a Hong Kong Jockey Club rating of 60. Horse ranking for section 1 of the race is pretty uniformly distributed, while the win odds are right skewed. Together, most horses and their jockeys weigh between 1000 lbs and 1200 lbs, and the data form a normal distribution.

Data Visualization: Bar Charts

From the following bar charts, we can see that the majority of the horses are 3 year old geldings (castrated male horses) from Australia and New Zealand.

Correlation: Pearson Ranking charts

The correlation between the variables is low. These results show there is a little positive correlation (section 1 position and win odds) and a little negative correlation (section 1 position and weight) but these numbers are not significant.

Correlation: Spearman's rank & Kendall's rank

Since some of my variables are ordinal and don't have normal distributions, I'll also compute Spearman's rank correlation and Kendall’s rank correlation.

I'll check for correlation between horse_rating (the rating number assigned by HKJC at the time of the race), position_sec1 (position of this horse in section 1 of the race), and win_odds (win odds for this horse at start of race).

Based on these calculations, we can confirm that there is some negative correlation between horse_rating and position_sec1, but it's very small. Also, horse_rating and win_odds are uncorrelated.

Data Visualization: Parallel Coordinates

With Parallel Coordinates we are able to compare the distributions of numerical variables between horses that won and those that did not win.

Horses with a higher ranking have a higher chance of winning. The rest of the graph is pretty dense even with the smaller sample size, but it seems like higher weight might mean more chance of winning as well but it's hard to tell.

Stacked Bar Charts

Using stacked bar charts we can compare horses that won to horses that didn’t win based on other variables.

Horses from Australia won the most, with New Zealand close behind. More geldings won than others. Also, horses that were age 3 won the most.

Part 2: Dimensionality and Feature Reduction

The features I will get rid of are: "race_id", "horse_no", "horse_id", "trainer_id", and "jockey_id." (ID doesn’t give us useful data, and "horse_gear" has too many unique combinations)

We can also fill in missing values. Since I filled in 2 missing values for horse_type and horse_country earlier with "Unknown", I am going to replace those "Unknown" values with the most common values.

Log Transformation for highly skewed data

If you go back and look at the histograms of win_odds, you’ll see that it is very skewed… many low odds, not very many high odds.

Since the win_odds variable is highly skewed, I'm going to apply a log transformation.

Converting categorical data into numbers (Country, Type)

Random Forest Classifier

I chose to use a Random Forest Classifier because of interpretability, and because I'm predicting binary classification. First, I'll remove the columns that don't contain useful information.

After calculating and visualizing the features in order of importance, I can see that ‘declared_weight’ is the most important feature, followed by 'win_odds_log1p', 'actual_weight', 'horse_rating', and 'position_sec1'.

Part 3 - Model Evaluation & Selection

Training - Splitting data into training and testing

Evaluation

We are trying to predict if a horse has won or not so this is a classification problem. I'm going to use logistic regression.

Metrics for the evaluation:

i. Confusion Matrix
ii. Precision, Recall & F1 score
iii. ROC curve

i. Confusion Matrix

Since the diagonal doesn't include the largest values, we can conclude that Logistic Regression is having a difficult time effectively modeling the horse racing data.

ii. Precision, Recall & F1 score

The results of Precision (high), Recall (low and high), and F1 Score (low and high) confirm that the model is not effective and is over-fitting. This could be due to the imbalanced nature of the data, and might suggest that another choice of model could be better, or that the hyperparameters for class weight could be adjusted.

ROC curve

The dotted line is the randomly guessed, so anything above that is good metric. The better the model, the closer it is to the solid line. From this visualization, the model is performing well, but we know that the classes are imbalanced, and there is definitely bias. If we had more data on horses that win for example, we might have a better model.

References:

Daley, G. (2019, November 17). Horse Racing in HK. Kaggle. Retrieved from https://www.kaggle.com/gdaley/hkracing

Keith Prowse. (2018, May 16). Off to the races: A horse racing glossary. Retrieved from https://www.keithprowse.co.uk/news-and-blog/2018/05/16/off-to-the-races---a-horse-racing-glossary/