File: Hotel-Recommendations.ipynb
Names: Corinne Medeiros
Date: 10/18/20
Usage: Program previews and summarizes Expedia Hotel Recommendations data, generates exploratory visualizations, and uses predictive models to predict hotel groups based on user data.

Creating Optimal Hotel Recommendations

Objective: Predict which “hotel cluster” the user is likely to book, given their search details.

Data source:
Expedia Hotel Recommendations
https://www.kaggle.com/c/expedia-hotel-recommendations

Hotel sign

Loading and Exploring Data

To understand the data, I reviewed the columns and descriptions provided by the Kaggle data overview tab:

https://www.kaggle.com/c/expedia-hotel-recommendations/data?select=train.csv

test_train_columns_1.png

test_train_columns_2.png

Loading Data

The dataset is very large, with over 37 million observations, so I will only load a smaller subset.

Besides the target variable of hotel_cluster, the columns I'm going to explore are user_id, is_package, site_name, user_location_country, hotel_continent, srch_adults_cnt, srch_children_cnt, and srch_destination_id.

Exploratory Visualizations

From this bar graph as well as the data quartile range summary statistics, even though the x-axis in this graph is too crowded to read, we can tell that the vast majority of our users in this subset represent country 66. I will confirm this with further plotting next.

This may be a bias introduced from only selecting a subset, so for future exploration I could try selecting another subset, or loading all of the data in chunks in order to see if the data represent a more diverse sample. For the purposes of this assignment and learning, I'm going to stick with this smaller subset.

After limiting the data to the top ten country count values, we can clearly confirm that our users mostly come from country 66.

Interpreting this box plot is difficult because the data are not represented very well. Hotel cluster is more of a discrete categorical variable and this treats it as continuous, which isn't very helpful. We can see that continent 0 represents a wider range of hotel clusters while continent 1 represents a smaller range, but we don't have enough information on the hotel clusters themselves to make this insight useful. I'm going to try looking at frequency of hotel clusters instead.

From this bar chart we can see that hotel clusters 91 and 41 are the most frequent groups, and the least common group is cluster 74.

Checking Correlation

I'm going to calculate correlation to get a sense of the relationships between some of the variables, which will help in data understanding and determining which predictive models might be most effective.

It looks like the strongest correlation is a positive relationship of about ~0.25 between hotel continent and site name. The other relationships are also not statistically significant. This tells us that we don't have to worry about multicollinearity when choosing predictive models.

Predictive Modeling

Since we are trying to predict the unique hotel cluster, we are dealing with a multi-class classification problem. First, I will look at how many hotel clusters exist.

Our target variable, hotel_cluster, consists of 100 unique values.

Splitting 'hotels_train_df' into train and test set

Random Forest Classifier

I chose to use a random forest classifier because it's a more accurate ensemble of trees, less biased, and I'm working with a larger amount of data.

Evaluating Random Forest Classifier

In these two methods of displaying the confusion matrix, we can see that there are a good amount of high values across the diagonal section, which is a good sign. However, in the larger version we can also see that there are high values dispersed throughout the sides as well which means there are a lot of incorrect predictions.

Naive Bayes Classifier

I'm going to try a Naive Bayes Classifier next, since my features are independent and because it tends to perform well with multiple classes.

Evaluating Naive Bayes Classifier

Results

Overall, my predictive models performed quite poorly. The Random Forest Classifier resulted in a 22% accuracy and the Naive Bayes Classifer only gave a 5% accuracy. The highest precision score from the Random Forest Classifier was 91% for hotel cluster 74, but the rest were mostly very low. To improve predictive power, I think it would help to have more information on what the attributes represent. For example, it would be nice to know how the hotel groups are determined and which locations correspond to country and continent numbers. This way, the results might be more interpretable. In addition, I could experiment with a different combination of features and different parameters when modeling. Finally, I could try building different ensembles of models to try achieving better accuracy and interpretability.