File: Wildlife-Population-Harvest.ipynb
Name: Corinne Medeiros
Date: 11/16/19
Usage: Program reads data from wildlife population and harvest data for Forest Service 2010 RPA assessment. Endangered and threatened status across species is analyzed through visualizations, analytical distribution models, hypothesis tests, and correlation tests.

Wildlife Population and Harvest Data

This dataset comes from the U.S. Department of Agriculture’s website, provided by the Forest Service Research & Development (FS R&D) on wildlife population and harvest data. It includes data captured from 1955 until 2010. The data extend across a range of assessment areas, including the Pacific Coast, Rocky Mountain, North, and South. For this project I'm focusing on groups of endangered and threatened species including mammals, birds, reptiles, and amphibians, with some initial exploration of harvest data by region.

Data Source:

Wildlife population and harvest data for Forest Service 2010 RPA Assessment https://doi.org/10.2737/RDS-2014-0009

Hypothesis:

My hypothesis explores the relationship between time and endangered or threatened status, as well as relationships between specific groups of species and all species together.

All Endangered or Threatened Species

Figure 22 from this data contains the cumulative number of species listed as threatened or endangered from 1 Jul 1976 through 27 October 2010 for all taxa, plants, animals, vertebrate groups (amphibians, birds, fish, mammals, reptiles), and invertebrate groups (arachnids, crustaceans, insects, and molluscs).

I will be focusing on animals from the vertebrate groups.

Importing Data

Cleaning Data

Visualizing Data

Endangerment and Threatened Status of All Species over time

Endangerment and Threatened Status of All Mammals over time

Species by Region - Initial Exploration

Although I ultimately decided to focus on the data from Figure 22, I was interested in exploring some of the region-specific files to get a sense of trends and possibilities for analysis alongside the main data file.

Visualizing Pacific Coast Region Data

Scatter plot representing the Red Fox harvest sum in comparison to the harvest sum of all species from the Pacific Coast region.

Visualizing South Region Data

Scatter plot representing the Red Fox harvest sum in comparison to the harvest sum of all species from the South region.

I realized that the specific region data would be difficult to combine with the main dataset because besides the year, all of the variables are completely different. So I realized it would be best to stick with the main dataset from Figure 22.

Endangered or Threatened Species

Histograms of Variables

All Species

All Mammals

All Birds

All Reptiles

All Amphibians

Outliers

All Species

All Mammals

All Birds

All Reptiles

All Amphibians

The largest and smallest values for these variables are reasonable in this context, so there's no need to remove any outliers.

Probability Mass Function (PMF)

I will be using the PMF to get probabilities of the possible values for all mammals threatened or endangered during the early years (1976 – 1980) and compare this to other years.

This PMF compares all mammals threatened or endangered during the early years (1976 – 1980) to other years as both a bar graph and a step function.

There is a much higher probability of seeing values below 40 during the early years (1976 - 1980) versus all other years.

Cumulative Distribution Function (CDF)

Over the years, less than 10% of the assessments were below 10 reptiles endangered or threatened, the most common number was 26, and the highest values, in the mid 30s, are higher than or equal to about 80% of the assessments.

This graph can tell us how a specific reading for reptiles falls within the range of readings for all reptiles.

Analytical Distribution

Normal Distribution

The curves in the all birds data deviate from the normal curve of the expected model. The majority of the lower numbers are between the 10th and 30th percentile rank while the most common higher value of 90 is in the 70th and 90th percentile rank.

Normal Probability Plot

The Normal Probability Plot confirms a lack of normality, with the tails deviating substantially from the model, and overall not a very straight line.

Scatterplots

All Species Endangered or Threatened over the years

Amphibians vs. Reptiles Endangered or Threatened

The scatter plots suggest strong positive correlation between the status of amphibians and reptiles, and also between dates and all species.

Covariance

The results indicate a positive relationship in both cases, but the units are not standardized, so correlation would be a better option.

Pearson's Correlation

It appears that there is a strong positive correlation between dates and all species' status, and the status of amphibians and reptiles, but since Pearson's correlation might underestimate the strength of non-linear relationships, I'll try Spearman's correlation as well.

Spearman's Correlation

Spearman's correlation calculations confirm the strong positive relationship in both cases. As an alternative, I'll also try converting the variables to make them closer to linear.

Adjusting for Non-Linear Relationships

Even when converting both or one of the variables, there is still a strong correlation in all scenarios.

Hypothesis Testing

Testing Correlation between Birds and Mammals

My null hypothesis is that there is no correlation between the endangered or threatened status of birds and mammals.

After 1000 iterations per HypothesisTest, the pvalue is 0, which tells us that there wasn't a correlation more significant than the null hypothesis. The pvalue proves that there is very little probability that we'd find a strong correlation within any given sample, so we can only conclude that the correlation between the endangered status of birds and mammals is probably not 0.

In comparing the actual correlation to the highest value from the iterations, we can get an idea of how unexpected the observed value is under the null hypothesis.

Regression Analysis

Overall, with high R^2 values, the regression results support strong correlation and predictive power, with the status of all species significantly accounting for variation in the status of all mammals. However, there is the problem of multicollinearity, because these variables are highly correlated, which takes away from the statistical significance of the all species variable.