s1Raw <- read.subtitles.season("Bojack SRT/Bojack S1 SRT/") %>% subDataFrame()
## Read: 12 episodes
s2Raw <- read.subtitles.season("Bojack SRT/Bojack S2 SRT/") %>% subDataFrame()
## Read: 12 episodes
s3Raw <- read.subtitles.season("Bojack SRT/Bojack S3 SRT/") %>% subDataFrame()
## Read: 12 episodes
s4Raw <- read.subtitles.season("Bojack SRT/Bojack S4 SRT/") %>% subDataFrame()
## Read: 12 episodes
all_s_Raw <- rbind(s1Raw,s2Raw,s3Raw,s4Raw) %>%
filter(!str_detect(.$Text, "7ed")) %>% # Removes the sync information.
filter(!str_detect(.$Text, "[0-9]x")) %>% # Removes episode title
rename(Line = ID) %>% # Avoids Excel error of reading CSV files as corrupted if column 1 is named "ID"
mutate(Line = str_remove_all(Line, "\\D"))
all_s_Raw %>% write_excel_csv("ALL_SEASONS_SRT_RAW.csv") # Suitable for transcription of speakers and listeners
This gives us a raw CSV file of the dialogue
Line | Timecode.in | Timecode.out | Text | season | season\_num | episode\_num |
---|---|---|---|---|---|---|
1 | 00:00:08.335 | 00:00:12.882 | Horsin’ Around is filmed before a live studio audience. | Bojack S1 SRT | 1 | 1 |
2 | 00:00:12.965 | 00:00:15.134 | Mondays. | Bojack S1 SRT | 1 | 1 |
3 | 00:00:15.218 | 00:00:18.012 | Well, good morning to you too. | Bojack S1 SRT | 1 | 1 |
4 | 00:00:18.096 | 00:00:19.346 | Oh, hey. | Bojack S1 SRT | 1 | 1 |
5 | 00:00:19.430 | 00:00:21.182 | Where? I’d love hay. | Bojack S1 SRT | 1 | 1 |
6 | 00:00:24.393 | 00:00:27.438 | In 1987, the situation comedy Horsin’ Around | Bojack S1 SRT | 1 | 1 |
7 | 00:00:27.521 | 00:00:28.898 | premiered on ABC. | Bojack S1 SRT | 1 | 1 |
8 | 00:00:28.981 | 00:00:31.109 | The show, in which a young, bachelor horse | Bojack S1 SRT | 1 | 1 |
9 | 00:00:31.192 | 00:00:33.319 | is forced to reevaluate his priorities | Bojack S1 SRT | 1 | 1 |
10 | 00:00:33.402 | 00:00:35.738 | when he agrees to raise three human children, | Bojack S1 SRT | 1 | 1 |
11 | 00:00:35.821 | 00:00:37.615 | was initially dismissed by critics | Bojack S1 SRT | 1 | 1 |
12 | 00:00:37.698 | 00:00:41.202 | as broad and saccharine and not good, | Bojack S1 SRT | 1 | 1 |
13 | 00:00:41.286 | 00:00:43.912 | but the family comedy struck a chord with America | Bojack S1 SRT | 1 | 1 |
14 | 00:00:43.995 | 00:00:46.164 | and went on to air for nine seasons. | Bojack S1 SRT | 1 | 1 |
15 | 00:00:46.248 | 00:00:49.209 | The star of Horsin’ Around, BoJack Horseman, | Bojack S1 SRT | 1 | 1 |
16 | 00:00:49.293 | 00:00:50.418 | is our guest tonight. | Bojack S1 SRT | 1 | 1 |
17 | 00:00:50.501 | 00:00:51.669 | Welcome, BoJack. | Bojack S1 SRT | 1 | 1 |
18 | 00:00:51.753 | 00:00:53.213 | It is good to be here, Charlie. | Bojack S1 SRT | 1 | 1 |
19 | 00:00:53.297 | 00:00:54.577 | Sorry I was late. The traffic… | Bojack S1 SRT | 1 | 1 |
20 | 00:00:54.631 | 00:00:55.924 | It’s really no problem. | Bojack S1 SRT | 1 | 1 |
At this point, the project is diverging into two paths:
The first will be a holistic analysis of the raw text, without assignation of lines to a specific character. This can tell us general information about the show, for example, whether the sentiment of an episode follows a trend from one episode, or one series, to the next.
The second will be to manually transcribe the dialogue, recording the sender and receivers of each line of dialogue. This will be a time-consuming process as there is no simple way to manage this unless I am able to find scripts, as opposed to SRT files, for the show. From this, I intend to develop a Shiny App allowing a user to examine each character’s interactions by episode and season, looking at frequency and sentiment of interactions with other characters.
The next page will be a simple frequency analysis of terms in the show, to see if the most basic application of TidyText can generate useful information from a raw text dataset.