File: Video-Characteristics-Transcoding_EDA.ipynb
Name: Corinne Medeiros
Date: 7/4/21
Desc: Analyzing Video Characteristics and Transcoding Time EDA
Usage: Program imports and cleans data, generates charts, and calculates correlation.

Analyzing Video Characteristics and Transcoding Time

Data Source

Online Video Characteristics and Transcoding Time Dataset
https://archive.ics.uci.edu/ml/datasets/Online+Video+Characteristics+and+Transcoding+Time+Dataset

This dataset from the UCI Machine Learning Repository contains two tsv files. The first file is 168,286 randomly sampled YouTube videos from 2015 along with their video characteristics including duration, bitrate, height, width, frame rate, codec, category, and url. The second file is 68,784 different instances of transcoding tests using a sample of videos from the first file. A more detailed list of attributes can be found in the dataset description in the link above.

Loading Videos Data

It looks like we don't have any missing data, but there are duplicates. Each video exists in different formats, so that explains the duplicate ID's. Let's find out which codecs and categories are present. I'll also look at duration.

Visualizing Variables

It appears that most videos are under 40 minutes (2,500 seconds).

Loading Transcoding Data

Here we can see that the most common codec in the transcoded observations is H264.

Correlation: Pearson Ranking charts

Here we can see that memory usage (umem) is highly correlated with transcode time. Bit rate and duration are also positively correlated with size, which makes sense. The rest of the relationships are very insignificant.