File: Customizing-Travel_EDA.ipynb
Name: Corinne Medeiros
Date: 7/18/21
Desc: Customizing Travel Based on User Ratings EDA
Usage: Program imports and cleans data, and generates charts.

Customizing Travel Based on User Ratings

Exploratory Data Analysis (EDA) in Python

Data Source

Travel Reviews Data Set
https://archive.ics.uci.edu/ml/datasets/Travel+Reviews#

This dataset from the UCI Machine Learning Repository contains one csv file with data from TripAdvisor.com reviews on destinations within East Asia. There are 980 observations, and each user has average ratings in 10 categories including art galleries, dance clubs, juice bars, restaurants, museums, resorts, parks, beaches, theaters, and religious institutions. Ratings are on a scale of Excellent (4), Very Good (3), Average (2), Poor (1), and Terrible (0).

Loading Data

Cleaning data

I'll use the following information from the source to change the column names to be more descriptive.

Attribute information:

Attribute 1 : Unique user id
Attribute 2 : Average user feedback on art galleries
Attribute 3 : Average user feedback on dance clubs
Attribute 4 : Average user feedback on juice bars
Attribute 5 : Average user feedback on restaurants
Attribute 6 : Average user feedback on museums
Attribute 7 : Average user feedback on resorts
Attribute 8 : Average user feedback on parks/picnic spots
Attribute 9 : Average user feedback on beaches
Attribute 10 : Average user feedback on theaters
Attribute 11 : Average user feedback on religious institutions

Overall this dataset is pretty clean, with no missing values and no apparent outliers.

Data Visualization

Heatmaps

Observations so far

My initial conclusion was that dance club fans also tend to enjoy parks and beaches, but looking back at the dataframe summary, we can see that all users rated beaches and parks very highly. It might be helpful to look at restaurants and art galleries, which were both rated low by dance club fans.

We can see in this heatmap of Restaurant lovers that Museums and Art Galleries are consistently rated low. Dance Clubs are also on the lower scale.

At this point I'm curious about the general relationships of ratings across all categories, so I'm going to create a pairs plot and also explore some visualizations in Tableau.

Pairs Plot

This pairs plot helps reveal patterns and trends across all the data. We can quickly see the distributions of variables and the relationships between variables. There doesn't appear to be many strong relationships, but Museum and Resort ratings have a slight positive correlation.

With a better understanding of the data, I'll save the cleaned dataframe and move into R for clustering analysis.