File: Customizing-Travel_EDA.ipynb
Name: Corinne Medeiros
Date: 7/18/21
Desc: Customizing Travel Based on User Ratings EDA
Usage: Program imports and cleans data, and generates charts.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import style
import matplotlib.cm as cm
import seaborn as sns
Travel Reviews Data Set
https://archive.ics.uci.edu/ml/datasets/Travel+Reviews#
This dataset from the UCI Machine Learning Repository contains one csv file with data from TripAdvisor.com reviews on destinations within East Asia. There are 980 observations, and each user has average ratings in 10 categories including art galleries, dance clubs, juice bars, restaurants, museums, resorts, parks, beaches, theaters, and religious institutions. Ratings are on a scale of Excellent (4), Very Good (3), Average (2), Poor (1), and Terrible (0).
# Loading user ratings data
travel_df = pd.read_csv('tripadvisor_review.csv')
# Data summary
travel_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 980 entries, 0 to 979 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 User ID 980 non-null object 1 Category 1 980 non-null float64 2 Category 2 980 non-null float64 3 Category 3 980 non-null float64 4 Category 4 980 non-null float64 5 Category 5 980 non-null float64 6 Category 6 980 non-null float64 7 Category 7 980 non-null float64 8 Category 8 980 non-null float64 9 Category 9 980 non-null float64 10 Category 10 980 non-null float64 dtypes: float64(10), object(1) memory usage: 84.3+ KB
# Summary of variables
travel_df.describe()
| Category 1 | Category 2 | Category 3 | Category 4 | Category 5 | Category 6 | Category 7 | Category 8 | Category 9 | Category 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 |
| mean | 0.893194 | 1.352612 | 1.013306 | 0.532500 | 0.939735 | 1.842898 | 3.180939 | 2.835061 | 1.569439 | 2.799224 |
| std | 0.326912 | 0.478280 | 0.788607 | 0.279731 | 0.437430 | 0.539538 | 0.007824 | 0.137505 | 0.364629 | 0.321380 |
| min | 0.340000 | 0.000000 | 0.130000 | 0.150000 | 0.060000 | 0.140000 | 3.160000 | 2.420000 | 0.740000 | 2.140000 |
| 25% | 0.670000 | 1.080000 | 0.270000 | 0.410000 | 0.640000 | 1.460000 | 3.180000 | 2.740000 | 1.310000 | 2.540000 |
| 50% | 0.830000 | 1.280000 | 0.820000 | 0.500000 | 0.900000 | 1.800000 | 3.180000 | 2.820000 | 1.540000 | 2.780000 |
| 75% | 1.020000 | 1.560000 | 1.572500 | 0.580000 | 1.200000 | 2.200000 | 3.180000 | 2.910000 | 1.760000 | 3.040000 |
| max | 3.220000 | 3.640000 | 3.620000 | 3.440000 | 3.300000 | 3.760000 | 3.210000 | 3.390000 | 3.170000 | 3.660000 |
# Previewing data
travel_df.head(10)
| User ID | Category 1 | Category 2 | Category 3 | Category 4 | Category 5 | Category 6 | Category 7 | Category 8 | Category 9 | Category 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User 1 | 0.93 | 1.80 | 2.29 | 0.62 | 0.80 | 2.42 | 3.19 | 2.79 | 1.82 | 2.42 |
| 1 | User 2 | 1.02 | 2.20 | 2.66 | 0.64 | 1.42 | 3.18 | 3.21 | 2.63 | 1.86 | 2.32 |
| 2 | User 3 | 1.22 | 0.80 | 0.54 | 0.53 | 0.24 | 1.54 | 3.18 | 2.80 | 1.31 | 2.50 |
| 3 | User 4 | 0.45 | 1.80 | 0.29 | 0.57 | 0.46 | 1.52 | 3.18 | 2.96 | 1.57 | 2.86 |
| 4 | User 5 | 0.51 | 1.20 | 1.18 | 0.57 | 1.54 | 2.02 | 3.18 | 2.78 | 1.18 | 2.54 |
| 5 | User 6 | 0.99 | 1.28 | 0.72 | 0.27 | 0.74 | 1.26 | 3.17 | 2.89 | 1.66 | 3.66 |
| 6 | User 7 | 0.90 | 1.36 | 0.26 | 0.32 | 0.86 | 1.58 | 3.17 | 2.66 | 1.22 | 3.22 |
| 7 | User 8 | 0.74 | 1.40 | 0.22 | 0.41 | 0.82 | 1.50 | 3.17 | 2.81 | 1.54 | 2.88 |
| 8 | User 9 | 1.12 | 1.76 | 1.04 | 0.64 | 0.82 | 2.14 | 3.18 | 2.79 | 1.41 | 2.54 |
| 9 | User 10 | 0.70 | 1.36 | 0.22 | 0.26 | 1.50 | 1.54 | 3.17 | 2.82 | 2.24 | 3.12 |
# Setting User ID as index
travel_df = travel_df.set_index('User ID')
I'll use the following information from the source to change the column names to be more descriptive.
Attribute information:
Attribute 1 : Unique user id
Attribute 2 : Average user feedback on art galleries
Attribute 3 : Average user feedback on dance clubs
Attribute 4 : Average user feedback on juice bars
Attribute 5 : Average user feedback on restaurants
Attribute 6 : Average user feedback on museums
Attribute 7 : Average user feedback on resorts
Attribute 8 : Average user feedback on parks/picnic spots
Attribute 9 : Average user feedback on beaches
Attribute 10 : Average user feedback on theaters
Attribute 11 : Average user feedback on religious institutions
# Renaming columns
travel_df.rename(columns = {'Category 1':'Art Galleries',
'Category 2':'Dance Clubs',
'Category 3':'Juice Bars',
'Category 4':'Restaurants',
'Category 5':'Museums',
'Category 6':'Resorts',
'Category 7':'Parks & Picnic Spots',
'Category 8':'Beaches',
'Category 9':'Theaters',
'Category 10':'Religious Institutions'}, inplace=True)
travel_df.head(10)
| Art Galleries | Dance Clubs | Juice Bars | Restaurants | Museums | Resorts | Parks & Picnic Spots | Beaches | Theaters | Religious Institutions | |
|---|---|---|---|---|---|---|---|---|---|---|
| User ID | ||||||||||
| User 1 | 0.93 | 1.80 | 2.29 | 0.62 | 0.80 | 2.42 | 3.19 | 2.79 | 1.82 | 2.42 |
| User 2 | 1.02 | 2.20 | 2.66 | 0.64 | 1.42 | 3.18 | 3.21 | 2.63 | 1.86 | 2.32 |
| User 3 | 1.22 | 0.80 | 0.54 | 0.53 | 0.24 | 1.54 | 3.18 | 2.80 | 1.31 | 2.50 |
| User 4 | 0.45 | 1.80 | 0.29 | 0.57 | 0.46 | 1.52 | 3.18 | 2.96 | 1.57 | 2.86 |
| User 5 | 0.51 | 1.20 | 1.18 | 0.57 | 1.54 | 2.02 | 3.18 | 2.78 | 1.18 | 2.54 |
| User 6 | 0.99 | 1.28 | 0.72 | 0.27 | 0.74 | 1.26 | 3.17 | 2.89 | 1.66 | 3.66 |
| User 7 | 0.90 | 1.36 | 0.26 | 0.32 | 0.86 | 1.58 | 3.17 | 2.66 | 1.22 | 3.22 |
| User 8 | 0.74 | 1.40 | 0.22 | 0.41 | 0.82 | 1.50 | 3.17 | 2.81 | 1.54 | 2.88 |
| User 9 | 1.12 | 1.76 | 1.04 | 0.64 | 0.82 | 2.14 | 3.18 | 2.79 | 1.41 | 2.54 |
| User 10 | 0.70 | 1.36 | 0.22 | 0.26 | 1.50 | 1.54 | 3.17 | 2.82 | 2.24 | 3.12 |
# Checking for missing data
travel_df.isna().sum()
Art Galleries 0 Dance Clubs 0 Juice Bars 0 Restaurants 0 Museums 0 Resorts 0 Parks & Picnic Spots 0 Beaches 0 Theaters 0 Religious Institutions 0 dtype: int64
# Removing index title
travel_df.index.name = ""
# Summary of variables
travel_df.describe()
| Art Galleries | Dance Clubs | Juice Bars | Restaurants | Museums | Resorts | Parks & Picnic Spots | Beaches | Theaters | Religious Institutions | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 | 980.000000 |
| mean | 0.893194 | 1.352612 | 1.013306 | 0.532500 | 0.939735 | 1.842898 | 3.180939 | 2.835061 | 1.569439 | 2.799224 |
| std | 0.326912 | 0.478280 | 0.788607 | 0.279731 | 0.437430 | 0.539538 | 0.007824 | 0.137505 | 0.364629 | 0.321380 |
| min | 0.340000 | 0.000000 | 0.130000 | 0.150000 | 0.060000 | 0.140000 | 3.160000 | 2.420000 | 0.740000 | 2.140000 |
| 25% | 0.670000 | 1.080000 | 0.270000 | 0.410000 | 0.640000 | 1.460000 | 3.180000 | 2.740000 | 1.310000 | 2.540000 |
| 50% | 0.830000 | 1.280000 | 0.820000 | 0.500000 | 0.900000 | 1.800000 | 3.180000 | 2.820000 | 1.540000 | 2.780000 |
| 75% | 1.020000 | 1.560000 | 1.572500 | 0.580000 | 1.200000 | 2.200000 | 3.180000 | 2.910000 | 1.760000 | 3.040000 |
| max | 3.220000 | 3.640000 | 3.620000 | 3.440000 | 3.300000 | 3.760000 | 3.210000 | 3.390000 | 3.170000 | 3.660000 |
Overall this dataset is pretty clean, with no missing values and no apparent outliers.
# Creating subset of users with highest ratings for Dance Clubs
travel_df_dance = travel_df[travel_df["Dance Clubs"] > 2.9]
travel_df_dance.info()
<class 'pandas.core.frame.DataFrame'> Index: 16 entries, User 12 to User 883 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Art Galleries 16 non-null float64 1 Dance Clubs 16 non-null float64 2 Juice Bars 16 non-null float64 3 Restaurants 16 non-null float64 4 Museums 16 non-null float64 5 Resorts 16 non-null float64 6 Parks & Picnic Spots 16 non-null float64 7 Beaches 16 non-null float64 8 Theaters 16 non-null float64 9 Religious Institutions 16 non-null float64 dtypes: float64(10) memory usage: 1.4+ KB
travel_df_dance
| Art Galleries | Dance Clubs | Juice Bars | Restaurants | Museums | Resorts | Parks & Picnic Spots | Beaches | Theaters | Religious Institutions | |
|---|---|---|---|---|---|---|---|---|---|---|
| User 12 | 0.96 | 2.96 | 0.29 | 0.38 | 0.88 | 2.08 | 3.17 | 2.93 | 1.66 | 3.42 |
| User 49 | 0.45 | 2.96 | 0.26 | 0.40 | 0.56 | 1.68 | 3.18 | 2.90 | 1.44 | 2.72 |
| User 76 | 0.99 | 2.96 | 1.71 | 0.45 | 1.36 | 1.96 | 3.19 | 2.69 | 1.38 | 2.32 |
| User 146 | 0.96 | 2.96 | 0.29 | 0.38 | 0.88 | 2.08 | 3.17 | 2.93 | 1.66 | 3.42 |
| User 229 | 0.54 | 2.96 | 0.58 | 0.57 | 0.78 | 1.70 | 3.18 | 2.91 | 1.28 | 3.30 |
| User 378 | 0.70 | 2.96 | 0.43 | 0.50 | 1.12 | 1.92 | 3.17 | 2.94 | 1.31 | 3.26 |
| User 423 | 0.51 | 2.92 | 0.16 | 1.58 | 1.54 | 2.00 | 3.18 | 2.74 | 0.96 | 2.46 |
| User 484 | 0.58 | 2.96 | 1.90 | 0.53 | 1.44 | 2.24 | 3.19 | 3.18 | 2.18 | 2.70 |
| User 546 | 0.86 | 2.96 | 0.24 | 0.41 | 0.72 | 1.62 | 3.18 | 2.98 | 1.54 | 2.98 |
| User 600 | 1.12 | 2.96 | 0.16 | 0.50 | 0.48 | 1.76 | 3.18 | 2.78 | 1.76 | 2.72 |
| User 609 | 1.06 | 3.12 | 1.58 | 0.48 | 0.56 | 3.34 | 3.18 | 2.67 | 1.76 | 2.94 |
| User 642 | 0.54 | 2.96 | 0.38 | 0.61 | 1.62 | 2.42 | 3.18 | 3.22 | 2.43 | 2.80 |
| User 703 | 1.09 | 2.96 | 1.79 | 0.79 | 1.18 | 2.00 | 3.19 | 2.66 | 2.59 | 2.46 |
| User 729 | 0.86 | 2.96 | 0.21 | 1.98 | 0.90 | 1.62 | 3.18 | 2.70 | 1.47 | 2.46 |
| User 813 | 0.96 | 3.64 | 1.31 | 0.39 | 1.34 | 2.66 | 3.18 | 3.02 | 1.57 | 3.12 |
| User 883 | 0.96 | 2.96 | 0.74 | 0.47 | 0.96 | 2.12 | 3.18 | 2.86 | 1.44 | 2.64 |
# Sorting by Dance Clubs ratings
travel_df_dance = travel_df_dance.sort_values(by=["Dance Clubs"], ascending=False)
# Setting font
sns.set(font_scale=1.2)
# Heatmap
ax = sns.heatmap(travel_df_dance, cmap=plt.cm.Blues, linewidths=.1, xticklabels=True, yticklabels=True)
fig = ax.get_figure()
fig.set_size_inches(15, 20)
# Labels, rotation
plt.xticks(rotation=70)
ax.set_title('2018 TripAdvisor Average Traveler Ratings for East Asia Destinations\n (Dance Club fans)', fontsize =20)
Text(0.5, 1.0, '2018 TripAdvisor Average Traveler Ratings for East Asia Destinations\n (Dance Club fans)')
# Creating subset of users with highest ratings for Restaurants
travel_df_food = travel_df[travel_df["Restaurants"] > 2]
travel_df_food.info()
<class 'pandas.core.frame.DataFrame'> Index: 9 entries, User 248 to User 830 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Art Galleries 9 non-null float64 1 Dance Clubs 9 non-null float64 2 Juice Bars 9 non-null float64 3 Restaurants 9 non-null float64 4 Museums 9 non-null float64 5 Resorts 9 non-null float64 6 Parks & Picnic Spots 9 non-null float64 7 Beaches 9 non-null float64 8 Theaters 9 non-null float64 9 Religious Institutions 9 non-null float64 dtypes: float64(10) memory usage: 792.0+ bytes
My initial conclusion was that dance club fans also tend to enjoy parks and beaches, but looking back at the dataframe summary, we can see that all users rated beaches and parks very highly. It might be helpful to look at restaurants and art galleries, which were both rated low by dance club fans.
travel_df_food
| Art Galleries | Dance Clubs | Juice Bars | Restaurants | Museums | Resorts | Parks & Picnic Spots | Beaches | Theaters | Religious Institutions | |
|---|---|---|---|---|---|---|---|---|---|---|
| User 248 | 1.50 | 1.96 | 2.08 | 2.73 | 1.12 | 2.94 | 3.20 | 2.63 | 1.63 | 2.46 |
| User 275 | 1.15 | 1.76 | 1.33 | 2.91 | 0.74 | 1.90 | 3.18 | 2.78 | 1.38 | 2.72 |
| User 287 | 0.93 | 1.76 | 0.58 | 2.25 | 2.00 | 2.44 | 3.18 | 2.67 | 1.22 | 2.42 |
| User 438 | 1.06 | 1.08 | 0.13 | 2.11 | 0.32 | 2.50 | 3.18 | 2.78 | 1.98 | 2.50 |
| User 593 | 0.70 | 2.28 | 0.22 | 2.38 | 0.38 | 1.28 | 3.18 | 2.81 | 1.38 | 2.66 |
| User 602 | 1.15 | 0.80 | 0.26 | 3.10 | 0.64 | 1.86 | 3.18 | 2.77 | 2.02 | 2.62 |
| User 667 | 1.95 | 1.52 | 1.94 | 3.44 | 0.64 | 2.94 | 3.20 | 2.62 | 1.54 | 2.46 |
| User 695 | 0.96 | 1.80 | 0.56 | 2.25 | 2.02 | 2.40 | 3.18 | 2.69 | 1.25 | 2.42 |
| User 830 | 0.93 | 1.76 | 0.58 | 2.29 | 2.00 | 2.42 | 3.18 | 2.69 | 1.31 | 2.42 |
# Sorting by Restaurants ratings
travel_df_food = travel_df_food.sort_values(by=["Restaurants"], ascending=False)
# Setting font
sns.set(font_scale=1.2)
# Heatmap
ax = sns.heatmap(travel_df_food, cmap=plt.cm.Oranges, linewidths=.1, xticklabels=True, yticklabels=True)
fig = ax.get_figure()
fig.set_size_inches(15, 10)
# Labels, rotation
plt.xticks(rotation=70)
ax.set_title('2018 TripAdvisor Average Traveler Ratings for East Asia Destinations\n (Restaurants fans)', fontsize =20)
Text(0.5, 1.0, '2018 TripAdvisor Average Traveler Ratings for East Asia Destinations\n (Restaurants fans)')
We can see in this heatmap of Restaurant lovers that Museums and Art Galleries are consistently rated low. Dance Clubs are also on the lower scale.
At this point I'm curious about the general relationships of ratings across all categories, so I'm going to create a pairs plot and also explore some visualizations in Tableau.
# Creating Pairplot
sns.pairplot(travel_df)
<seaborn.axisgrid.PairGrid at 0x10d65a668>
This pairs plot helps reveal patterns and trends across all the data. We can quickly see the distributions of variables and the relationships between variables. There doesn't appear to be many strong relationships, but Museum and Resort ratings have a slight positive correlation.
With a better understanding of the data, I'll save the cleaned dataframe and move into R for clustering analysis.
# Saving dataframe as csv file
travel_df.to_csv('travel_df.csv')