Bojack-Tidytext-Analysis

Importing Data

s1Raw <- read.subtitles.season("Bojack SRT/Bojack S1 SRT/") %>% subDataFrame()
## Read: 12 episodes
s2Raw <- read.subtitles.season("Bojack SRT/Bojack S2 SRT/") %>% subDataFrame()
## Read: 12 episodes
s3Raw <- read.subtitles.season("Bojack SRT/Bojack S3 SRT/") %>% subDataFrame()
## Read: 12 episodes
s4Raw <- read.subtitles.season("Bojack SRT/Bojack S4 SRT/") %>% subDataFrame()
## Read: 12 episodes
all_s_Raw <- rbind(s1Raw,s2Raw,s3Raw,s4Raw) %>%
  filter(!str_detect(.$Text, "7ed")) %>% # Removes the sync information.
  filter(!str_detect(.$Text, "[0-9]x")) %>% # Removes episode title
  rename(Line = ID) %>% # Avoids Excel error of reading CSV files as corrupted if column 1 is named "ID"
  mutate(Line = str_remove_all(Line, "\\D"))

all_s_Raw %>% write_excel_csv("ALL_SEASONS_SRT_RAW.csv") # Suitable for transcription of speakers and listeners

This gives us a raw CSV file of the dialogue

Line Timecode.in Timecode.out Text season season\_num episode\_num
1 00:00:08.335 00:00:12.882 Horsin’ Around is filmed before a live studio audience. Bojack S1 SRT 1 1
2 00:00:12.965 00:00:15.134 Mondays. Bojack S1 SRT 1 1
3 00:00:15.218 00:00:18.012 Well, good morning to you too. Bojack S1 SRT 1 1
4 00:00:18.096 00:00:19.346 Oh, hey. Bojack S1 SRT 1 1
5 00:00:19.430 00:00:21.182 Where? I’d love hay. Bojack S1 SRT 1 1
6 00:00:24.393 00:00:27.438 In 1987, the situation comedy Horsin’ Around Bojack S1 SRT 1 1
7 00:00:27.521 00:00:28.898 premiered on ABC. Bojack S1 SRT 1 1
8 00:00:28.981 00:00:31.109 The show, in which a young, bachelor horse Bojack S1 SRT 1 1
9 00:00:31.192 00:00:33.319 is forced to reevaluate his priorities Bojack S1 SRT 1 1
10 00:00:33.402 00:00:35.738 when he agrees to raise three human children, Bojack S1 SRT 1 1
11 00:00:35.821 00:00:37.615 was initially dismissed by critics Bojack S1 SRT 1 1
12 00:00:37.698 00:00:41.202 as broad and saccharine and not good, Bojack S1 SRT 1 1
13 00:00:41.286 00:00:43.912 but the family comedy struck a chord with America Bojack S1 SRT 1 1
14 00:00:43.995 00:00:46.164 and went on to air for nine seasons. Bojack S1 SRT 1 1
15 00:00:46.248 00:00:49.209 The star of Horsin’ Around, BoJack Horseman, Bojack S1 SRT 1 1
16 00:00:49.293 00:00:50.418 is our guest tonight. Bojack S1 SRT 1 1
17 00:00:50.501 00:00:51.669 Welcome, BoJack. Bojack S1 SRT 1 1
18 00:00:51.753 00:00:53.213 It is good to be here, Charlie. Bojack S1 SRT 1 1
19 00:00:53.297 00:00:54.577 Sorry I was late. The traffic… Bojack S1 SRT 1 1
20 00:00:54.631 00:00:55.924 It’s really no problem. Bojack S1 SRT 1 1

At this point, the project is diverging into two paths:

The first will be a holistic analysis of the raw text, without assignation of lines to a specific character. This can tell us general information about the show, for example, whether the sentiment of an episode follows a trend from one episode, or one series, to the next.

The second will be to manually transcribe the dialogue, recording the sender and receivers of each line of dialogue. This will be a time-consuming process as there is no simple way to manage this unless I am able to find scripts, as opposed to SRT files, for the show. From this, I intend to develop a Shiny App allowing a user to examine each character’s interactions by episode and season, looking at frequency and sentiment of interactions with other characters.


The next page will be a simple frequency analysis of terms in the show, to see if the most basic application of TidyText can generate useful information from a raw text dataset.