Even for those who don’t have a reason to take advantage of it on a regular basis, most people who have watched broadcast or cable television are familiar with closed captioning – the text-based visual representation of a program’s audio elements. While typically associated with making programming more accessible for hearing-impaired viewers, closed captioning also allows audiences to better understand a show’s narration in noisy spaces like restaurants or with the audio volume turned down in sound-sensitive environments.
As content consumption has increasingly moved online, the need for closed captioning has grown beyond traditional television to include streamed video. Public sector organizations now depend heavily on video streaming as a key component of their transparency initiatives, and closed captioning is essential in reaching the widest range of constituents while complying with the regulatory accessibility requirements of many jurisdictions.
Manual Transcripts – A Thing of the Past
Historically, the text that viewers see in closed captions was created entirely manually for insertion into the video signal or stream. Human transcribers typed the text into a captioning system based on what they heard. Many programs were captioned by single transcriber, but redundant transcribers would be sometimes used to ensure that nothing was missed, particularly for live programming in which the content could not be paused or rewound.
While manual processes are often thought of as being error-prone, professional closed captioning transcribers are very proficient at their craft, and typically deliver remarkably high accuracy – usually around 95%. However, the production costs of manual closed captioning can be extremely expensive, particularly for live content. In addition to the labour expense of hiring closed captioning experts or third-party services, robust communications infrastructure is also needed to deliver the raw video and audio feeds to the service provider for captioning. Considering that multi-hour council meetings are often significantly longer in duration than most entertainment-based programming, the costs of captioning can quickly add up.
Enter: Automation and Speech-to-Text
Before delving into the role that Artificial Intelligence (AI) can play in automated closed captioning workflows, it’s useful to consider what today’s “AI” really consists of. While sensationalized representations often imply machines “thinking” for themselves, AI at its core revolves around recognizing and processing patterns.
In 2011, long-running television game show Jeopardy pitted two of its all-time highest-winning champions against the IBM Watson system. Watson didn’t just win the matchup – it decimated the competition, compiling a score (measured in virtual dollars) more than triple that of the contestant who placed second. But Watson didn’t really “understand” the questions conceptually; instead, it matched and evaluated data patterns.
Similarly, basic speech-to-text analysis identifies a pattern based on what a spoken word sounds like, analyzes that pattern against its existing base of “knowledge”, and returns a text result based on what it believes the equivalent word is. This process has a certain degree of accuracy that continues to improve as speech-to-text algorithms evolve. However, with many words – or combinations of words – sounding very similar even to humans, analyzing individual words in isolation is not enough.
Context, Context, Context
Contextual analysis takes automated closed captioning to the next level, enabling AI to evaluate multiple possible words to determine which ones make the most sense in context. The system can weight possible results with the probabilities of which ones are most likely based on surrounding words. The AI can then choose the most likely sequence of words out of all of the possible combinations it identified.
Between speech-to-text techniques and contextual analysis, automated closed captioning can rival human capabilities. Of course, though, the AI may make many of the same mistakes that humans do – particularly when the most accurate result is not the most likely. For example, AI would probably get the famously misheard Jimi Hendrix lyric wrong just as many people did, since “kiss this guy” may be considered a more common and logical combination of words than “kiss the sky”.
Machine Learning to the Rescue
Specific names can also be problematic, both for AI and for humans. For example, a place name such as “Oshawa” can be challenging for closed captioning; if you don’t live near there or even know it’s a city, the word doesn’t mean anything to you. AI might not even know that it’s a proper name, thus defeating contextual analysis.
The solution to this problem is to “train” the system to recognize particular words. Just as people can learn names and what they mean, advances in deep machine learning systems have enabled the development of “trainable” closed captioning engines that can master proper names and the particularities of pronunciation. One method of such training is to first feed the system an existing recording (such as a previous council meeting) with an accurate manual transcription of that content; the system can then correlate the recording’s sounds with the transcript, effectively ‘learning’ the words.
Automated closed captio