Book Insights from Goodreads

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

Introduction

Using data from the Goodreads API, I’m analyzing which factors have the most impact on the popularity of a book. Booksellers, publishers, and authors could be interested in this analysis for optimizing ratings given known factors of success. This dataset, downloaded from Kaggle, contains titles, authors, ratings, languages, page counts, and ratings counts for over 13,000 books. It was created on May 25, 2019, and has been frequently updated. Redundant information has been removed, with the purpose of creating a clean dataset for booklovers, consisting of 10 variables. The Kaggle dataset can be downloaded here: https://www.kaggle.com/jealousleopard/goodreadsbooks

Preview of the original data:

# Glimpse of original datset
glimpse(books_data)

## Observations: 13,724
## Variables: 10
## $ bookID             <int> 1, 2, 3, 4, 5, 8, 9, 10, 12, 13, 14, 16, 18, …
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harr…
## $ authors            <chr> "J.K. Rowling-Mary GrandPré", "J.K. Rowling-M…
## $ average_rating     <chr> "4.56", "4.49", "4.47", "4.41", "4.55", "4.78…
## $ isbn               <chr> "0439785960", "0439358078", "0439554934", "04…
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554…
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "eng", "en…
## $ X..num_pages       <chr> "652", "870", "320", "352", "435", "2690", "1…
## $ ratings_count      <int> 1944099, 1996446, 5629932, 6267, 2149872, 388…
## $ text_reviews_count <int> 26249, 27613, 70390, 272, 33964, 154, 1, 820,…

Data Cleaning Process

Adjusting column names and variable types to improve readability and future plotting:

# Changing "X..num_pages" column name to "number_of_pages"
colnames(books_data)[colnames(books_data)=="X..num_pages"] <- "number_of_pages"

# Changing number of pages to number type
books_data$number_of_pages <- as.integer(books_data$number_of_pages)

## Warning: NAs introduced by coercion

Checking for and removing missing values:

# Looking at summary of missing values
missing <- is.na(books_data)
summary(missing)

##    bookID          title          authors        average_rating 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:13724     FALSE:13724     FALSE:13724     FALSE:13724    
##                                                                 
##     isbn           isbn13        language_code   number_of_pages
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:13724     FALSE:13724     FALSE:13724     FALSE:13714    
##                                                  TRUE :10       
##  ratings_count   text_reviews_count
##  Mode :logical   Mode :logical     
##  FALSE:13719     FALSE:13719       
##  TRUE :5         TRUE :5

# Total number of missing values
sum(is.na(books_data))

## [1] 20

# Removing missing values
books_data_no_na <- na.omit(books_data)

# Checking new dataset to confirm missing values have been removed
sum(is.na(books_data_no_na))

## [1] 0

# Replacing all numbers in language code column with empty string
books_data_no_na$language_code <- gsub("[0-9]+", "", books_data_no_na$language_code)

# Removing observations from language code column that are only empty strings
books_data_no_na <- books_data_no_na[books_data_no_na$language_code!="",]

Removing unneccessary columns from dataset:

I’m removing the isbn columns since they contain all unique values that won’t help me in this analysis.

books_data_new = subset(books_data_no_na, select = -c(isbn, isbn13))

# Preview of cleaned dataset
head(books_data_new, n = 10)

bookID	title	authors	average_rating	language_code	number_of_pages	ratings_count	text_reviews_count
1	Harry Potter and the Half-Blood Prince (Harry Potter #6)	J.K. Rowling-Mary GrandPré	4.56	eng	652	1944099	26249
2	Harry Potter and the Order of the Phoenix (Harry Potter #5)	J.K. Rowling-Mary GrandPré	4.49	eng	870	1996446	27613
3	Harry Potter and the Sorcerer’s Stone (Harry Potter #1)	J.K. Rowling-Mary GrandPré	4.47	eng	320	5629932	70390
4	Harry Potter and the Chamber of Secrets (Harry Potter #2)	J.K. Rowling	4.41	eng	352	6267	272
5	Harry Potter and the Prisoner of Azkaban (Harry Potter #3)	J.K. Rowling-Mary GrandPré	4.55	eng	435	2149872	33964
8	Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)	J.K. Rowling-Mary GrandPré	4.78	eng	2690	38872	154
9	Unauthorized Harry Potter Book Seven News: Half-Blood Prince Analysis and Speculation	W. Frederick Zimmerman	3.69	en-US	152	18	1
10	Harry Potter Collection (Harry Potter #1-6)	J.K. Rowling	4.73	eng	3342	27410	820
12	The Ultimate Hitchhiker’s Guide: Five Complete Novels and One Story (Hitchhiker’s Guide to the Galaxy #1-5)	Douglas Adams	4.38	eng	815	3602	258
13	The Ultimate Hitchhiker’s Guide to the Galaxy	Douglas Adams	4.38	eng	815	240189	3954

Accounting for multiple authors:

# Converting authors column to character type
books_data_new$authors <- as.character(books_data_new$authors)

# Adding new variable to dataset to identify books with multiple authors
books_data_new$multiple_authors <- grepl("-", books_data_new$authors)

Data Visualization:

First, I’m looking at how page count affects average rating. As page numbers increase, do ratings rise or fall?

# Filtering data to page numbers less than 1,000 
books_data_new <- 
  books_data_new %>%  
  filter(number_of_pages < 1000)

# Scatterplot to explore relationship between number of pages and average rating
ggplot(books_data_new, aes(x = number_of_pages, y = as.numeric(average_rating))) +
  geom_point(alpha = 0.5, color = 'darkgreen') + 
  ggtitle("Goodreads Number of Pages vs. Average Rating") +
  xlab("Number of Pages") + ylab("Average Rating")

This graph is the most dense between 100 and 400 pages, which tells us that because the majority of books have that range of page numbers, that’s where most of the ratings fall. As expected, the highest and lowest ratings have very little data points. As page numbers increase, there is a drop in the amount of ratings, therefore there are less data points.

Next, I’m going to look at language codes. What is the distribution like?

# Histogram to show distribution of language codes
ggplot(data = books_data_new) +
  geom_bar(mapping = aes(x = language_code), fill = 'purple') + 
  scale_x_discrete(labels = c("en-US" = "US", "en-GB" = "GB", "en-CA" = "CA")) + 
  ggtitle("Goodreads Language Codes") +
  xlab("Language Code") + ylab("Count")

Since the majority is English, I’ll try filtering out the outliers.

# Filtering data to only English language books
books_data_new.eng <- 
  books_data_new %>%  
  filter(language_code=="eng")

# Scatterplot to explore relationship between number of pages and average rating
ggplot(books_data_new.eng, aes(x = number_of_pages, y = as.numeric(average_rating), color = ratings_count)) +
  geom_point(alpha = 0.5) + 
  ggtitle("Goodreads Number of Pages vs. Average Rating (English Language)") +
  xlab("Number of Pages") + ylab("Average Rating") + labs(color = "Ratings Count") + 
  scale_color_continuous(labels = comma)

This graph is very similar to the previous scatterplot, because the majority of books in this dataset are in fact English language books. The same conclusions apply.

Do books with a higher amount of reviews have higher ratings?

# Filtering data to books with less than 500 reviews
books_data_new.eng.f <-
  books_data_new.eng %>%
  filter(text_reviews_count < 500)

# Scatterplot for number of reviews vs. average rating
ggplot(books_data_new.eng.f, aes(x = text_reviews_count, y = as.numeric(average_rating), color = factor(multiple_authors))) +
  geom_point(alpha = 0.5) + 
  ggtitle("Goodreads Number of Reviews vs. Average Rating (English Language)") +
  xlab("Number of Reviews") + ylab("Average Rating") + labs(color = "Multiple Authors")

# Correlation between reviews and ratings
cor(books_data_new.eng$text_reviews_count, as.numeric(books_data_new.eng$average_rating))

## [1] 0.04413932

The average rating stays the same even as more people review it. The reviews beyond a certain number don’t have any impact on rating. Also, having multiple authors doesn’t seem to have any distinguishable effect on average rating. The correlation analysis of ~.04 confirms there is little to no significant relationship between the number of reviews and average rating.

Multiple Regression Analysis:

# Selecting variables from dataset
books_data_new.eng.1 <- books_data_new.eng[, c("average_rating", "number_of_pages", "ratings_count", "text_reviews_count")]

# Multiple Regression model predicting average rating
books_data_new.eng.1_mod <- lm(as.numeric(average_rating) ~ number_of_pages + ratings_count + text_reviews_count, data = books_data_new.eng)

summary(books_data_new.eng.1_mod)

## 
## Call:
## lm(formula = as.numeric(average_rating) ~ number_of_pages + ratings_count + 
##     text_reviews_count, data = books_data_new.eng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0525 -0.1517  0.0250  0.1990  1.1616 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.838e+00  7.083e-03 541.848   <2e-16 ***
## number_of_pages    2.586e-04  1.932e-05  13.386   <2e-16 ***
## ratings_count      1.104e-07  5.463e-08   2.020   0.0434 *  
## text_reviews_count 3.356e-07  2.451e-06   0.137   0.8911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3612 on 10357 degrees of freedom
## Multiple R-squared:  0.01925,    Adjusted R-squared:  0.01896 
## F-statistic: 67.75 on 3 and 10357 DF,  p-value: < 2.2e-16

The results from the multiple regression analysis summary reveal the low significance of the relationship between these variables in terms of predicting average rating.

Implications to the consumer:

Some of the key takeaways for Booksellers, publishers, and authors are: the number of authors doesn’t impact ratings, readers generally tend to gravitate towards books that have between 100 and 400 pages, and the amount of reviews a book has doesn’t impact average rating. The initial round of reviews is critical since they have more weight than reviews that come later for a book.

Limitations and areas of improvement:

Some of the limitations of this analysis include limited variables and limited language codes. This limits us from gaining insights from books of other languages. Also, having more variables like genre and date published would improve the scope and accuracy. Year published could help us understand historical trends and possibly help predict future trends using machine learning algorithms.

Book Insights from Goodreads

Corinne Medeiros

8/8/2019

Introduction

Preview of the original data:

Data Cleaning Process

Adjusting column names and variable types to improve readability and future plotting:

Checking for and removing missing values:

Removing unneccessary columns from dataset:

I’m removing the isbn columns since they contain all unique values that won’t help me in this analysis.

Accounting for multiple authors:

Data Visualization:

First, I’m looking at how page count affects average rating. As page numbers increase, do ratings rise or fall?

Next, I’m going to look at language codes. What is the distribution like?

Since the majority is English, I’ll try filtering out the outliers.

This graph is very similar to the previous scatterplot, because the majority of books in this dataset are in fact English language books. The same conclusions apply.

Do books with a higher amount of reviews have higher ratings?

Multiple Regression Analysis:

The results from the multiple regression analysis summary reveal the low significance of the relationship between these variables in terms of predicting average rating.

Implications to the consumer:

Limitations and areas of improvement: