Printer Friendly

Captioning live online video.

Have you tried to caption live video online? It's the Wild West! It's far from plug and play. The solutions are out there, but they require a fair amount of effort to implement. Beyond being a major win for accessibility and inclusion, captions can attract more viewers, result in higher engagement, and ultimately increase the impact of your live content online.


The lack of solutions and standards is not a sign that no one's thinking about captioning, but rather evidence that the industry is moving really fast. As viewers migrate away from over-the-air broadcast to online and OTT video delivery, services like closed captioning could see a major boost in quality as speech-to-text improves. Captioning will inevitably be enhanced by innovations that are sure to come with the digital territory.

Realistically, we're at least a couple of years away from realizing that captioning future. Given the dizzying amount of live video hours that go online every minute, do we stand a chance of captioning anything in the meantime? If you're ready to roll up your sleeves and dig into solutions that are far from plug and play, the answer is a resounding yes.

The Business Case for Live Captioning

Let's start with the business case: Why caption your live video? You may have a host of good reasons already. Here are some of the benefits to consider:

* Captions increase inclusion. Captions make your content more accessible to more people. While it may not be mandated today by the FCC unless you are a broadcaster (more on this later in this article), sending a message of inclusion is likely to have a positive impact on the way your audience views your brand or program. Reading also improves comprehension for some, especially second-language viewers, which means your message will come across more clearly.

* Captioning means more viewers and more engagement. Videos with captions are more consumable by everyone. Facebook leadership has predicted that the platform will be all video in 5 years (, and company reps shared that internal tests indicate a 12% increased view time on captioned video ads ( Now is the time to get ahead of solving your captioning challenges so you will be ready for a video-driven future online.

* Captioning is venue-agnostic. Whether your video is playing on a mobile device or on a monitor in a noisy airport terminal, the message is still conveyed. Nowadays, many public lobbies and spaces feature displays. Captioning your video makes it relevant and consumable regardless of the environment.

* Captioning enhances visibility to search engines. Captioning gives you the highly valuable benefit of your content surfacing higher and faster in searches, because search engines can index the text in your video. Live captions go a step further, letting your PR team quickly pull quotes for the press and your marketing team efficiently publish companion ebooks and blog posts. You can also flag inaccurate or inappropriate content for swift removal if needed.

* Captions improve content analysis. Data mining possibilities are infinite when you have the full transcript of your program immediately available. For example, you can easily determine term frequency to see what words are coming up most often in your programs.

The final consideration in the case for live captioning is cost. Live captioning services cost about $150-$250 per hour. That fee can include delivery of a corrected transcript and caption file, which you can use to enhance the video-on-demand (VOD) version of your live event. Investments in hardware and software will vary widely depending on your workflow and requirements.

Captioning Basics

While there are numerous benefits that come with captioning, how to go about doing it is not always obvious. Before we dive into solutions and implementation, let's cover some captioning basics.

* Captions vs. subtitles: While both captions and subtitles involve displaying text on screen, there are fundamental differences. Captions are a text rendering of all the audio information on a program: dialogue, music, sound effects, and cues. Subtitles, by comparison, are for comprehension--most often when the language being spoken deviates from the primary language of the program or when a viewer chooses to have a foreign language translation.

* Live vs. on-demand: Captioning live video is very different from adding captions to VOD. The latter is pretty well supported across online platforms with multiple file format options available and thorough documentation on how to do it just a search away. Live captioning presents all sorts of challenges, from accuracy and timing, to the technology and equipment required to make it all work. We came up rather empty when we searched for "captioning live video online," which is why we wrote this article.

* Closed vs. open: Closed captions (Figure 1) are encoded in the video signal and then decoded by the player or device, with the ubiquitous toggle we are all accustomed to. Open captions (Figure 2) are burned into the video signal, visible regardless of the player, and cannot be turned off by the viewer. Open captions degrade at lower resolutions while closed captions do not.



* Scrolling vs. pop-on: Scrolling and pop-on are the two main styles of displaying captions. Scrolling/paint-on captions tend to provide a better experience for live content. Pop-on captions are carefully timed with the action on screen. While there are ways to improve the timing of live captions, this kind of precision is not possible, and so pop-ons often drop from the screen too quickly in a live scenario.

* 608 vs. 708: Do captioning standards still matter? Though it may not be relevant for much longer, the CEA-708 standard is the bridge from broadcast television standards to captioning live video online. The EIA-608 standard was developed many years ago; 708 was added when television went digital. For more on 608/708, see the sidebar, "Captioning Standards: Where We Are and How We Got Here."

In the U.S., most live captioning is done by typing, using special software that converts the information into captions that can be added to the video signal. Voice-writing, also known as respeaking, is a method popular in other parts of the world that uses speech-to-text to generate the captions.

We took a U.S.-centric approach with this article. Broadcast standards differ around the globe, and multilanguage and character set support for captions are commonplace.

Implementing Captions: Key Questions to Ask

Now that you understand how captioning your live video makes business sense, and you've got some of the basic concepts down, where should you begin with implementation? We recommend getting to know your use case(s) really well to help guide decisions you will need to make along the way. Here are a few questions to get you started:

* What platform(s) will you use to live stream the video? What live captioning solutions do they support?

* If you have a production workflow in place, what support does it include for closed captions? Will the encoders you use pass closed captioning as part of the signal? Does the codec support it?

* Is it imperative that the user be able to toggle captions on or off? Or is it acceptable or preferred to have them "open" or burned into the video?

* Are you captioning a one-time-only live event, or are you looking for a solution to an ongoing need? Does it need to scale?

How Does Captioning Work?

If you plan to do captioning on the production side, things are little more straightforward (albeit more costly). Most commonly, it involves the use of a hardware caption encoder in conjunction with a broadcast encoder. A caption encoder, such as the EEG HD492 (Figure 3), is a pass-through device that receives the video signal (usually over SDI), and sends an audio feed over the internet to a closed-captions provider.


The captions provider, via transcription or respeaking, converts the speech to text, which is sent back to the caption encoder. The caption encoder optionally delays the video to account for the time the conversion takes, then embeds the data into the SDI. This output is then fed into a real-time broadcast encoder (such as Elemental Live), where it is converted into your broadcast format (HLS, RTMP, etc.).

Figure 4 shows a signal flow example for live captioning. This tends to be a fairly expensive process. Replacing this equipment with a software-only solution, such as EEG Falcon, will likely prove to be more scalable and affordable--assuming it works within your workflow and is integrated into your platform of choice.


How to Get the Best Caption Quality

There are several steps you can take to get the best possible quality when delivering live captions. Here's what we recommend:

* Define your use case. Creating an ideal user experience starts with understanding the use case really well.

* Prepare your content first. Send all the content details you have available to the captioning provider in advance. This may include show description, speaker names, run of show, event link, special terms and acronyms, scripts, and any other helpful context you can provide. Captioners use shortcuts and macros to increase their speed and accuracy. Having this information ahead of time allows them to prepare those and means a significant boost in the quality of your captions.

* Does your show include video playback? If you'll be playing back video during your broadcast, check to see if that content is already captioned. If so, you may be able to pass those existing captions along rather than create live ones, provided, of course, they are better quality than you think you can get from a live captioner. Make sure to let the providers know to expect any videos so they can refrain from captioning those portions. Another option is to send the caption file or transcript for the video to the captioning provider in advance.

* Consider appearance options. Depending on your workflow, you may have a variety of options for screen placement, color, and style of your captions. Experiment to see what works best for your content. For example, captions along the bottom could cover lower-third graphics so an upper placement might be better. While scrolling captions tend to work better for live, you may have a good reason to go with pop-on, so be sure to test that too.

* Test, test, test. Make sure to give yourself plenty of time to test and iterate the captioning solution you want to implement. For example, a test could reveal a mismatch between the caption standard and the broadcast format you are using. Since the technology involved is constantly evolving, you want to create a process that includes frequent testing from end to end to avoid any failures when it is time to go live.

Final Words of Advice

Captioning live video online is challenging, and in some cases it is expensive. The payoffs are compelling: more viewers, higher engagement, and increased impact. There are no plug-and-play solutions readily available, so be prepared to do your homework.

If you hit roadblocks, seek advice from people like us who are already doing it. Form alliances with your key vendors to help you get it right. Accept that it may take awhile to get there. Finally, captions with room for improvement are better than no captions at all. So just get started with a commitment to improving and enhancing your user experience along the way.

By Heather Hurford and Matt Szatmary

Heather Hurford ( is Live Video Producer at LinkedIn. Matthew Szatmary ( is Senior Video Encoding/Playback Engineer at Twitch.

We offer special thanks for contributions from Naomi Black, Alex Barrett, Jamie Baughman, Dan Swiney, and Heather Duthie.

Comments? Email us at, or check the masthead for other ways to contact us.

Captioning Standards: Where We Are and How We Got Here

The Television Decoder Circuitry Act of 1990 requires televisions with screens 13" or larger to have support for closed captions. The passage of this bill into law solidified the dominance of the EIA-608 standard already in use by the NTSC. Later, when CEA-708 was developed, it included an EIA-608 backward compatibility mode to carry a 608 payload.

Thus, it came as no surprise that when Apple released HLS in 2009, it chose to support this format. This solution worked brilliantly, as it was a workflow that broadcasters were already accustomed to. In addition, it leveraged much of the existing, and often expensive, caption encoder hardware and services.

The 608/708 format does have some drawbacks, however. First, it is highly U.S.-centric, with extensions for limited support of European character sets. Support for Asian and Arabic encodings are simply nonexistent. It is also a very complicated standard and has very few free and open source tools available. This made it difficult for organizations that did not produce live video or whose primary consumers were on the internet to work with the standard.

In 2012, Apple added help for segmented Web Video Text Tracks (WebVTT) support in iOS and Safari. Even though the Apple platforms still do not support some advanced WebVTT features such as styling and CSS, the changes to iOS and Safari made dealing with captions, especially for VOD content, much easier. No custom tools were needed--only a basic text editor to make the VTT files and update the M3U8 file manifests was needed. It also added support for the full UTF-8 character set.

Some features were lost, however, such as paint-on and pop-up modes. This makes live corrections impossible, which is a serious limitation since stenographers and voice writers often make them. Producers also need to know in advance exactly how long to leave the text on the screen before replacing it. In a live scenario, that's impossible to do without delaying the captions.

There are many other standards for closed captioning, including TTML, SMPTE-TT, SRT, and SSA. Each of them has pros and cons. As history demonstrates time after time, in any format war the winner will be not the most advanced technology, but the technology that the greatest number of consumers can access. It ultimately comes down to the question, "What can your users decode and play back?"

The focus on HLS here is not necessarily because it's the best, but because it has the most rigid requirements. If you need captions on iOS, these are the only options that are at least semi-standards-compliant (i.e., the only ones that avoid vendor lock-in). Other platforms are more adaptable. On Android, ExoPlayer supports both Apple's version of live WebVTT as well as 608. And on the web, HLS.js (from Dailymotion) supports 608 all for free. Many commercial players such as JW Player and THEOplayer offer similar capabilities.

If iOS is not a concern, and/or you have a DASH-based platform, things are trickier. ExoPlayer can support in-band WebVTT and TTML captions in fmp4. For the web, you must either have a player parse the fmp4 to send cues to the text-track HTML5 element, or support something custom similar to Apple's version of WebVTT. Check with your player vendor. (This assumes you can find a live encoder that can produce this format.)

The 608/708 format is most certainly past its prime. It's difficult to work with, limited in capabilities, and, again, U.S.-centric. Unfortunately, there is not a clear path to a replacement. There are several major reasons for this. First, most new proposed formats are developed with VOD in mind. Taking VOD-centric standards and attempting to adapt them to live is generally unworkable. Just as the internet application industry shifted to a mobile-first mentality, those of us in internet video must shift to a live-first mentality. Second, new standards are not focused on the production side. If one of these standards could deliver a full photon-to-photon solution, it would have little competition.
COPYRIGHT 2017 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:how-to's and tutorials
Author:Hurford, Heather; Szatmary, Matt
Publication:Streaming Media
Date:Mar 1, 2017
Previous Article:Dynamic packaging: need to reach the widest possible audience with the fewest encoding steps? Here's how.
Next Article:Video quality optimization: from spatial to temporal to perceptual, optimizing video all depends on how it's measured..

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters