Complex Data Exercise

Author

Cassia Roth

Published

April 11, 2024

This exercise is designed to give you practice with exploring text datasets (a complex data type). We will load, process, and engage in some discriptive/exploratory analysis of the data.

This dataset, titled Sherlock, is a package that includes text from the Sherlock Holmes book series by Sir Arthur Conan Doyle. All 48 texts are in the public domain. Information regarding copyright laws is available here.

I found this dataset through Emil Hvitfeldt’s R-text-data compilation repository. I also am using Paul Vanderlaken’s website as guidance.

First, we will install and load the dataset.

#suppressing log messages
#| message: false
#| warning: false
#| include: false

#installing dataset and other needed packages
devtools::install_github("EmilHvitfeldt/sherlock")

Skipping install of 'sherlock' from a github remote, the SHA1 (38584034) has not changed since last install.
  Use `force = TRUE` to force installation

#loading dataset and other necessary packages
library(sherlock)
library(tidytext)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.4     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)
library(tibble)
library(knitr)
library(here)

here() starts at /Users/cassiaroth/Documents/GitHub/MADARoth/cassiaroth-MADA-portfolio

Next, we will explore the dataset. We will start by looking at the structure.

#Exploring the dataset
dplyr::glimpse(holmes)

Rows: 65,958
Columns: 2
$ text <chr> "A STUDY IN SCARLET", "", "Table of contents", "", "Part I", "Mr.…
$ book <chr> "A Study In Scarlet", "A Study In Scarlet", "A Study In Scarlet",…

#Viewing the first few rows (variable names "text" and "book")
head(holmes)

# A tibble: 6 × 2
  text                  book              
  <chr>                 <chr>             
1 "A STUDY IN SCARLET"  A Study In Scarlet
2 ""                    A Study In Scarlet
3 "Table of contents"   A Study In Scarlet
4 ""                    A Study In Scarlet
5 "Part I"              A Study In Scarlet
6 "Mr. Sherlock Holmes" A Study In Scarlet

#Understanding all books in the data
book_titles <- holmes %>% distinct(book)
print(book_titles)

# A tibble: 48 × 1
   book                               
   <chr>                              
 1 A Study In Scarlet                 
 2 The Sign of the Four               
 3 A Scandal in Bohemia               
 4 The Red-Headed League              
 5 A Case of Identity                 
 6 The Boscombe Valley Mystery        
 7 The Five Orange Pips               
 8 The Man with the Twisted Lip       
 9 The Adventure of the Blue Carbuncle
10 The Adventure of the Speckled Band 
# ℹ 38 more rows

#Ordering the entries alphabetically by book title
book_titles <- book_titles[order(book_titles$book), ]

#Creating a table using kable
table1 <- kable(book_titles, caption = "Book Titles")
print(table1)



Table: Book Titles

|book                                        |
|:-------------------------------------------|
|A Case of Identity                          |
|A Scandal in Bohemia                        |
|A Study In Scarlet                          |
|His Last Bow                                |
|Silver Blaze                                |
|The "Gloria Scott"                          |
|The Adventure of Black Peter                |
|The Adventure of Charles Augustus Milverton |
|The Adventure of the Abbey Grange           |
|The Adventure of the Beryl Coronet          |
|The Adventure of the Blue Carbuncle         |
|The Adventure of the Bruce-Partington Plans |
|The Adventure of the Cardboard Box          |
|The Adventure of the Copper Beeches         |
|The Adventure of the Dancing Men            |
|The Adventure of the Devil's Foot           |
|The Adventure of the Dying Detective        |
|The Adventure of the Empty House            |
|The Adventure of the Engineer's Thumb       |
|The Adventure of the Golden Pince-Nez       |
|The Adventure of the Missing Three-Quarter  |
|The Adventure of the Noble Bachelor         |
|The Adventure of the Norwood Builder        |
|The Adventure of the Priory School          |
|The Adventure of the Red Circle             |
|The Adventure of the Second Stain           |
|The Adventure of the Six Napoleons          |
|The Adventure of the Solitary Cyclist       |
|The Adventure of the Speckled Band          |
|The Adventure of the Three Students         |
|The Adventure of Wisteria Lodge             |
|The Boscombe Valley Mystery                 |
|The Crooked Man                             |
|The Disappearance of Lady Frances Carfax    |
|The Final Problem                           |
|The Five Orange Pips                        |
|The Greek Interpreter                       |
|The Hound of the Baskervilles               |
|The Man with the Twisted Lip                |
|The Musgrave Ritual                         |
|The Naval Treaty                            |
|The Red-Headed League                       |
|The Reigate Squires                         |
|The Resident Patient                        |
|The Sign of the Four                        |
|The Stock-Broker's Clerk                    |
|The Valley Of Fear                          |
|The Yellow Face                             |

Now let’s see how many times some of the most important people’s names appears in the texts.

#Searching for specific words
specific_words <- c("sherlock", "holmes", "moriarty", "watson", "john")

#Creating a word frequency table for specific words
word_freq_table <- holmes %>%
  unnest_tokens(word, text) %>%
  filter(word %in% specific_words) %>%
  count(word, sort = TRUE)

word_freq_df <- as.data.frame(word_freq_table)
print(word_freq_df)

      word    n
1   holmes 2403
2   watson  809
3 sherlock  383
4     john  118
5 moriarty   49

Mistakenly, many people attribute the quote “Elementary, my dear Watson,” to the Sherlock Holmes series. So let’s check how many times the phrase appears in the data. For fun, we will also see if Sherlock Holmes ever calls his sidekick “John Watson” by his full name.

#Searching for phrases about Watson
phrases <- c("John Watson", "Elementary, my dear Watson")

holmes_filtered <- holmes %>%
  filter(str_detect(text, fixed(phrases[1])) | 
         str_detect(text, fixed(phrases[2])))

#Viewing the filtered dataset
print(holmes_filtered)

# A tibble: 2 × 2
  text                                                     book              
  <chr>                                                    <chr>             
1 A Continuation Of The Reminiscences Of John Watson, M.D. A Study In Scarlet
2 A Continuation Of The Reminiscences Of John Watson, M.D. A Study In Scarlet

We can see that “John Watson” appears only twice, when he is writing about his own experiences. The phrase “Elementary, my dear Watson,” appears zero times. So, it’s a myth that Sherlock Holmes ever said this phrase in the original texts.

Now let’s visualize some of the words in a word cloud using the wordcloud2 package.

# Install and load the wordcloud2 package
# install.packages("wordcloud2")
library(wordcloud2)

# Load stopwords
stop_words <- tidytext::stop_words$word

# Remove stopwords from the text data
holmes_wc <- holmes %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words)

# Create word frequency table
word_freq_table <- holmes_wc %>%
  count(word, sort = TRUE)

# Visualize word cloud
wordcloud2(word_freq_table, size = 1.5)

We can see that the word “Holmes” is the most frequent word in the dataset, appearing 2403 times. This is expected, as the dataset is about Sherlock Holmes. Watson (appearing 809 times) is also quite frequent, as he is the narrator of the stories. other words that stood out to me include “London” (340 times), “police” (283 times), and “Sherlock” (383 times).