- Captions are a method for adding written words to a video.
- Live captions are captions that are created in real time alongside a live broadcast or event. These are created based on the live audio.
- Offline captions are captions that are created not in real time, which are often more accurate than live captions, but are not created in real time.
- Closed captions are a form of captions which allow the end viewer to toggle the captions on or off at their discretion, typically through a “CC” button on their television or web player.
- Open captions are a form of captions that are “burned-in” to the video and the viewer cannot turn them off.
- Human transcriptionists are people who, in real time, listen to the audio from a program and type the caption text as they hear it. Similar to what a court reporter does.
- AI transcription systems are automated programs that use speech recognition to detect words in an audio stream and convert them to written text. Similar to how your mobile phone can convert speech into text, although usually much more sophisticated.
- Subtitles are similar to captions in how they look, although subtitles are used to display a different written language from the language that is present in the audio.
- Live captions are created in real time, whether it's by an AI system or a human.
- Real-time live captions are always subject to errors due to the nature of transcribing speech to text live.
- Live captions are still usually around 95% accurate.
- AI-powered systems can't distinguish words that could be offensive or problematic, and cannot understand context: therefore these systems are more prone to gaffes that a human wouldn’t make, despite being overall fast and reliable in general.
- Although human transcriptionists likely won't make errors of insensitivity (such as transposing words that could be offensive in the wrong context), they will still make general typos and errors in the course of transcribing live speech. This will be especially noticeable with homonyms or near-homonyms.
- Since real-time captions are being created and displayed while the program is being broadcast, there is no opportunity to proofread captions or to replay difficult to understand audio. As a result, errors do occur, usually in the form of incorrect, though phonetically similar, words (such as ‘row place’ instead of ‘replace,’ ‘dock’ instead of ‘dark,’ etc.)
- When partnering with a human transcriptionist (as opposed to an AI-based system), we try to provide the transcriptionist with any background notes, acronyms, proper nouns, scripts, etc. that will be used during the live event. This is a primer for the transcriptionist, but it will not be fed into the caption stream verbatim. Providing scripts to a transcriptionist ahead of time is only meant to help avoid common misspellings (Jane Smith vs. Smythe), introduce local or proper nouns (such as Puyallup or Snohomish), or familiarize the transcriptionist with proprietary acronyms (such as DOS, CPAP, MCT...). Providing scripts and other background material to a transcriptionist is not required, but it will help increase the overall accuracy.
Why are closed captions so often inaccurate?
- Captions are either done by software or humans. Both make mistakes. Live captioning is hardest like sports, news, live events. No time to go back and fix errors. If the background sound is too loud, the person or software can’t always know the exact words used. Captioning for filmed tv programs and movies allow more time to check for errors.
- A lot of times, captioning is automated - a machine or a computer does it, and while the programs improve occasionally, at this point in time they are nowhere near able to understand speech and sounds 100% accurately in 100% of situations. Background noise, accents, and mumbling can all affect accuracy just as much as the program not being very good. (These are also things that can affect human captioners.)
- There are two main processes to closed captioning — live and taped. On taped shows such as sitcoms and dramas, the captioners are provided with a script beforehand and they have time to enter the captions and to check their work. Live shows, such as the news do not have the luxury of knowing what will be said during the broadcast, so the captioners are captioning by ear. The captioners are using stenographer-type machines, like those court reporters use, in which the input is entered as phonetic representations of what is said, which a computer translates to written English. It is up to the captioners to monitor what is entered and correct errors as they occur. This is often where you see most of the errors in captioning occur — the garbage and undecipherable captions that I think you are referring to. And as you may have guessed, a lot depends on the skills and knowledge the captioner has.
National Captioning Institute FAQ on captioning errors:
Q. WHY ARE THERE MISTAKES IN THE CAPTIONING?
A. NCI’s standard of accuracy for live captioning exceeds an average of 98%, and it often surpasses 99.3%. Although we are immensely proud of the quality of work we produce, errors are unfortunately part of real-time captioning due to the sheer amount of data involved and the live nature of its transmission. Some reasons for real-time mistakes include:
- Technical problems in transmitting and receiving the captions
- Muffled, hard-to-understand, or otherwise compromised audio can lead to incorrect text. [Note: including accents that skew the sound of words]
- Real-time captioning is displayed immediately, leaving no time to proofread.
- The captioner could hit the wrong keys or mispronounce a word, or the computer could incorrectly interpret the phonetic code.
- Captions may fall behind because there is a limit on how fast the television set can display them.
Article about captions & caption errors:
"Why I Hope Closed Caption Typos Never Go Away Completely"
“Real-time captioners use a computerized system based on the stenographic shorthand used by court reporters. A real-time captioner, someone who has been trained to transcribe speech to text using a steno machine, listens to a program’s dialogue as it is being broadcast and enters the words phonetically in stenographic shorthand code. The steno machine is connected to a computer containing software that translates stenographic shorthand into words using standard spellings and then converts them to a caption format…”
“Since real-time captions are being created and displayed while the program is being broadcast, there is no opportunity to proofread captions or to replay difficult to understand audio. As a result, errors do occur, usually in the form of incorrect, though phonetically similar, words (such as ‘row place’ instead of ‘replace.’) NCI continuously assesses each real-time captioner’s work so that accuracy rates of 98 percent or better can be maintained.”