File: CA-Travel-Trends_DataPrep_EDA.ipynb
Names: Corinne Medeiros, Amy Nestingen
Date: 11/12/20
Usage: Program cleans data, generates exploratory visualizations, and saves cleaned data to a csv file.

Predicting Travel Trends in the United States 2019 - 2020

Data Prep and Exploratory Data Analysis

Data source:
https://catalog.data.gov/dataset/trips-by-distance

Loading and Previewing Data

Displaying Data Summaries

Observations so far:

Missing Data

Since we have a good amount of data to work with, we're going to remove the rows with missing data.

Data Cleanup

With the newly updated data, we now have observations up until October 24, 2020. The previous version of the data set only went up until August 29th.

Exploratory Visualizations: All Data

Using all of the observations in one graph is mostly convoluted and ineffective, so in order to make our data more managable and focused, we're going to narrow our analysis to California counties. Even amidst the overcrowding in the above plot, we can see the general ups and downs of travel displayed. The trend is what we originally expected. There are more trips in the first half of the graph (2019) with a drop across most of 2020, and a dramatic spike during the summer (August 2020). Finally, there is a drop at the end of the data during the Fall. To better explain this trend, we might need to supplement with Covid-19 data.

Filtering Data to California Counties

Previously, there were 34,863 observations for California, so with the updated data we've gained 3,230 more observations.

Exploratory Visualizations: California Data

This boxplot looks slightly better than the last plot, and conveys the same overall trend, but there are still too many data to be illustrated effectively in a boxplot. Next, we will try a few scatterplots.

The first scatterplot depicting trips taken in California follows the expected pattern, while the second scatterplot depicting the population staying at home shows an opposite pattern but still expected. This makes sense that trips taken would be negatively correlated with population staying at home. In general, the less trips taken, the more people are staying at home.

To account for the dramatic change in trips taken and population staying at home beginning in March of 2020, it will be helpful to look at Covid-19 cases in California.

Covid-19 Supplemental Data

Data Source:
https://data.ca.gov/dataset/covid-19-cases/resource/926fd08f-cc91-4828-af38-bd45de97f8c3

Loading Data

Previously, our data included 11,225 rows, but now with updated data we have 13,925.

Cleaning Covid-19 Data

This updated data set includes up until November 5, 2020.

Plotting Covid-19 Data

From this graph we can confirm that the amount of Covid-19 cases started rising in March of 2020, right as the number of trips taken started decreasing and the population staying at home started increasing. With the newly added data, we can see the recent even more dramatic rise in cases during our current month of November 2020.

At this point, we'll save our cleaned data as csv files to import into RStudio for modeling.

Saving Cleaned Data