Even for those who don’t have a reason to take advantage of it on a regular basis, most people who have watched broadcast or cable television are familiar with closed captioning – the text-based visual representation of a program’s audio elements. While typically associated with making programming more accessible for hearing-impaired viewers, closed captioning also allows audiences to better understand a show’s narration in noisy spaces like restaurants or with the audio volume turned down in sound-sensitive environments.
As content consumption has increasingly moved online, the need for closed captioning has grown beyond traditional television to include streamed video. Public sector organizations now depend heavily on video streaming as a key component of their transparency initiatives, and closed captioning is essential in reaching the widest range of constituents while complying with the regulatory accessibility requirements of many jurisdictions.
Manual Transcripts – A Thing of the Past
Historically, the text that viewers see in closed captions was created entirely manually for insertion into the video signal or stream. Human transcribers typed the text into a captioning system based on what they heard. Many programs were captioned by single transcriber, but redundant transcribers would be sometimes used to ensure that nothing was missed, particularly for live programming in which the content could not be paused or rewound.
While manual processes are often thought of as being error-prone, professional closed captioning transcribers are very proficient at their craft, and typically deliver remarkably high accuracy – usually around 95%. However, the production costs of manual closed captioning can be extremely expensive, particularly for live content. In addition to the labour expense of hiring closed captioning experts or third-party services, robust communications infrastructure is also needed to deliver the raw video and audio feeds to the service provider for captioning. Considering that multi-hour council meetings are often significantly longer in duration than most entertainment-based programming, the costs of captioning can quickly add up.
Enter: Automation and Speech-to-Text
Before delving into the role that Artificial Intelligence (AI) can play in automated closed captioning workflows, it’s useful to consider what today’s “AI” really consists of. While sensationalized representations often imply machines “thinking” for themselves, AI at its core revolves around recognizing and processing patterns.
In 2011, long-running television game show Jeopardy pitted two of its all-time highest-winning champions against the IBM Watson system. Watson didn’t just win the matchup – it decimated the competition, compiling a score (measured in virtual dollars) more than triple that of the contestant who placed second. But Watson didn’t really “understand” the questions conceptually; instead, it matched and evaluated data patterns.
Similarly, basic speech-to-text analysis identifies a pattern based on what a spoken word sounds like, analyzes that pattern against its existing base of “knowledge”, and returns a text result based on what it believes the equivalent word is. This process has a certain degree of accuracy that continues to improve as speech-to-text algorithms evolve. However, with many words – or combinations of words – sounding very similar even to humans, analyzing individual words in isolation is not enough.
Context, Context, Context
Contextual analysis takes automated closed captioning to the next level, enabling AI to evaluate multiple possible words to determine which ones make the most sense in context. The system can weight possible results with the probabilities of which ones are most likely based on surrounding words. The AI can then choose the most likely sequence of words out of all of the possible combinations it identified.
Between speech-to-text techniques and contextual analysis, automated closed captioning can rival human capabilities. Of course, though, the AI may make many of the same mistakes that humans do – particularly when the most accurate result is not the most likely. For example, AI would probably get the famously misheard Jimi Hendrix lyric wrong just as many people did, since “kiss this guy” may be considered a more common and logical combination of words than “kiss the sky”.
Machine Learning to the Rescue
Specific names can also be problematic, both for AI and for humans. For example, a place name such as “Oshawa” can be challenging for closed captioning; if you don’t live near there or even know it’s a city, the word doesn’t mean anything to you. AI might not even know that it’s a proper name, thus defeating contextual analysis.
The solution to this problem is to “train” the system to recognize particular words. Just as people can learn names and what they mean, advances in deep machine learning systems have enabled the development of “trainable” closed captioning engines that can master proper names and the particularities of pronunciation. One method of such training is to first feed the system an existing recording (such as a previous council meeting) with an accurate manual transcription of that content; the system can then correlate the recording’s sounds with the transcript, effectively ‘learning’ the words.
Automated closed captioning technologies are usually evaluated by scoring their accuracy as a percentage of words correctly chosen compared to the original spoken dialogue. In practice, however, such systems should actually be judged by the consumer experience – the satisfaction of the end viewer and whether they are able to properly understand the meaning of what was said. 95% accuracy based on word count is meaningless if the captioning misses or messes up the most important key word for the viewer to understand what the speaker intended and that can happen with either human or electronic transcription.
The people for whom closed captioning was developed – such as hearing-impaired individuals – don’t see or care about a “score”; for them, it’s all about the overall viewing experience and getting the information they need. That’s far more important than any quantitative metric, but is totally subjective. And of course, for those who depend on closed captioning, almost any level of accuracy is better than having no closed captions at all.
Automated closed captioning is offered as a fully-integrated option for eSCRIBE’s Webcasting Plus module, effortlessly bolstering your accessibility. And as an added bonus, eSCRIBE’s closed captioning process also automatically generates a transcript that can be used to validate and update manually-entered meeting minutes – particularly valuable in jurisdictions that use the narrative style of minutes. Learn more about these and other important web streaming features in our white paper “Key Considerations for Public Sector Webcasting,” then contact us to see how your organization can start taking advantage of machine learning-enabled, automated closed captioning for your video streams.