This exercise is designed to give you practice with exploring text datasets (a complex data type). We will load, process, and engage in some discriptive/exploratory analysis of the data.
This dataset, titled Sherlock, is a package that includes text from the Sherlock Holmes book series by Sir Arthur Conan Doyle. All 48 texts are in the public domain. Information regarding copyright laws is available here.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /Users/cassiaroth/Documents/GitHub/MADARoth/cassiaroth-MADA-portfolio
Next, we will explore the dataset. We will start by looking at the structure.
#Exploring the datasetdplyr::glimpse(holmes)
Rows: 65,958
Columns: 2
$ text <chr> "A STUDY IN SCARLET", "", "Table of contents", "", "Part I", "Mr.…
$ book <chr> "A Study In Scarlet", "A Study In Scarlet", "A Study In Scarlet",…
#Viewing the first few rows (variable names "text" and "book")head(holmes)
# A tibble: 6 × 2
text book
<chr> <chr>
1 "A STUDY IN SCARLET" A Study In Scarlet
2 "" A Study In Scarlet
3 "Table of contents" A Study In Scarlet
4 "" A Study In Scarlet
5 "Part I" A Study In Scarlet
6 "Mr. Sherlock Holmes" A Study In Scarlet
#Understanding all books in the databook_titles <- holmes %>%distinct(book)print(book_titles)
# A tibble: 48 × 1
book
<chr>
1 A Study In Scarlet
2 The Sign of the Four
3 A Scandal in Bohemia
4 The Red-Headed League
5 A Case of Identity
6 The Boscombe Valley Mystery
7 The Five Orange Pips
8 The Man with the Twisted Lip
9 The Adventure of the Blue Carbuncle
10 The Adventure of the Speckled Band
# ℹ 38 more rows
#Ordering the entries alphabetically by book titlebook_titles <- book_titles[order(book_titles$book), ]#Creating a table using kabletable1 <-kable(book_titles, caption ="Book Titles")print(table1)
Table: Book Titles
|book |
|:-------------------------------------------|
|A Case of Identity |
|A Scandal in Bohemia |
|A Study In Scarlet |
|His Last Bow |
|Silver Blaze |
|The "Gloria Scott" |
|The Adventure of Black Peter |
|The Adventure of Charles Augustus Milverton |
|The Adventure of the Abbey Grange |
|The Adventure of the Beryl Coronet |
|The Adventure of the Blue Carbuncle |
|The Adventure of the Bruce-Partington Plans |
|The Adventure of the Cardboard Box |
|The Adventure of the Copper Beeches |
|The Adventure of the Dancing Men |
|The Adventure of the Devil's Foot |
|The Adventure of the Dying Detective |
|The Adventure of the Empty House |
|The Adventure of the Engineer's Thumb |
|The Adventure of the Golden Pince-Nez |
|The Adventure of the Missing Three-Quarter |
|The Adventure of the Noble Bachelor |
|The Adventure of the Norwood Builder |
|The Adventure of the Priory School |
|The Adventure of the Red Circle |
|The Adventure of the Second Stain |
|The Adventure of the Six Napoleons |
|The Adventure of the Solitary Cyclist |
|The Adventure of the Speckled Band |
|The Adventure of the Three Students |
|The Adventure of Wisteria Lodge |
|The Boscombe Valley Mystery |
|The Crooked Man |
|The Disappearance of Lady Frances Carfax |
|The Final Problem |
|The Five Orange Pips |
|The Greek Interpreter |
|The Hound of the Baskervilles |
|The Man with the Twisted Lip |
|The Musgrave Ritual |
|The Naval Treaty |
|The Red-Headed League |
|The Reigate Squires |
|The Resident Patient |
|The Sign of the Four |
|The Stock-Broker's Clerk |
|The Valley Of Fear |
|The Yellow Face |
Now let’s see how many times some of the most important people’s names appears in the texts.
#Searching for specific wordsspecific_words <-c("sherlock", "holmes", "moriarty", "watson", "john")#Creating a word frequency table for specific wordsword_freq_table <- holmes %>%unnest_tokens(word, text) %>%filter(word %in% specific_words) %>%count(word, sort =TRUE)word_freq_df <-as.data.frame(word_freq_table)print(word_freq_df)
word n
1 holmes 2403
2 watson 809
3 sherlock 383
4 john 118
5 moriarty 49
Mistakenly, many people attribute the quote “Elementary, my dear Watson,” to the Sherlock Holmes series. So let’s check how many times the phrase appears in the data. For fun, we will also see if Sherlock Holmes ever calls his sidekick “John Watson” by his full name.
#Searching for phrases about Watsonphrases <-c("John Watson", "Elementary, my dear Watson")holmes_filtered <- holmes %>%filter(str_detect(text, fixed(phrases[1])) |str_detect(text, fixed(phrases[2])))#Viewing the filtered datasetprint(holmes_filtered)
# A tibble: 2 × 2
text book
<chr> <chr>
1 A Continuation Of The Reminiscences Of John Watson, M.D. A Study In Scarlet
2 A Continuation Of The Reminiscences Of John Watson, M.D. A Study In Scarlet
We can see that “John Watson” appears only twice, when he is writing about his own experiences. The phrase “Elementary, my dear Watson,” appears zero times. So, it’s a myth that Sherlock Holmes ever said this phrase in the original texts.
Now let’s visualize some of the words in a word cloud using the wordcloud2 package.
# Install and load the wordcloud2 package# install.packages("wordcloud2")library(wordcloud2)# Load stopwordsstop_words <- tidytext::stop_words$word# Remove stopwords from the text dataholmes_wc <- holmes %>%unnest_tokens(word, text) %>%filter(!word %in% stop_words)# Create word frequency tableword_freq_table <- holmes_wc %>%count(word, sort =TRUE)# Visualize word cloudwordcloud2(word_freq_table, size =1.5)
We can see that the word “Holmes” is the most frequent word in the dataset, appearing 2403 times. This is expected, as the dataset is about Sherlock Holmes. Watson (appearing 809 times) is also quite frequent, as he is the narrator of the stories. other words that stood out to me include “London” (340 times), “police” (283 times), and “Sherlock” (383 times).