ZERO-SUM GAME: IN THE AV WORLD, ZERO-FRAME LATENCY ISN'T JUST A PIPE DREAM--IT'S A REQUIREMENT. THE STREAMING INDUSTRY WOULD DO WELL TO PAY ATTENTION.
Many of these are reasonable, centering on network capacity or intermittency, the cost to scale out low-latency solutions, or even the limitation of off-the-shelf processors to handle 4K Ultra HD or high dynamic range (HDR) content in real time.
But the issue is fundamentally deeper than any of those issues, going to the codecs themselves and the packaging and segmentation that's sprung up around scalable streaming video, both of which add inherent latency. A few of us have been ranting about these latencies since the advent of HDS, HLS, and even DASH. The move toward OTT live streaming has brought these latencies--or synchronicities, as one industry colleague referred to the issue of latency at Streaming Media East 2019 --to the forefront.
To better address latency for streaming, let's use this article to explore ways to deliver video and audio that absolutely, positively have to be there now (to paraphrase the once-popular Federal Express slogan).
It's not a theoretical exercise, as can be attested in conversations at trade shows like Info Comm, where corporations and houses of worship are looking to deliver content both locally (through the use of image magnification, or IMAG) with absolutely no latency and remotely (across campus or to distance-learning students). These knowledgeable users, for both operational complexity and cost benefit reasons, don't want to have to deploy two solutions, a zero-latency one for local delivery and a very-low-latency one for remote users who will expect to interact with the presenter and his or her local audience.
Is the Codec Salvageable?
In the zero-latency local delivery use case, a standard segmentation-packaging streaming approach fails miserably, but the problem starts well before the packaging step, at streaming's core: the encoders.
It's not just the encoder's problem, though, as many of them have been optimized over time to compress our industry-standard codecs. A major part of the problem lies with the codecs themselves, along with the overall deficiencies for zero-latency encoding and delivery.
Discussions around live-streaming encoding and delivery often include a classic three-legged stool illustration, or what one of our interviewees for this article refers to as the "codec triangle" for decision making. The three "legs," or triangle "sides," must be in balance for a streaming solution to work. These three areas are speed, quality, and bandwidth. Some substitute the term "cost" for "bandwidth," but both emphasize the fact that the higher the bandwidth, the higher the cost of consumption by consumers and corporations alike.
Streaming at scale is premised on the idea of saving bandwidth. As such, for on-demand content, the emphasis is placed on the intersection of speed and quality to preserve bandwidth. To eke out the best quality at the lowest bandwidth, video-on-demand encoders are allowed to spend more time than the length of the asset (e.g., 2 hours to encode a 1-hour video file) to create a final product that looks the best it can at a given bandwidth with a given codec.
To achieve quality over limited bandwidth, the streaming industry makes heavy use of interframe compression, in which a group of pictures (GoP) is aggregated together and compressed across time, with only the differences between adjacent images in the GoP being encoded. These less-than-total-image frames are referred to as P or B frames; the initial frame in every GoP is called a key frame or I-frame.
Almost all interframe compression solutions, including H.264 (AVC) and H.265 (HEVC), use an IPB approach, and the results are impressive when it comes to saving bandwidth. In many cases, using P and B frames, it's possible to see upward of 70% aggregated bandwidth savings across a single GoP of 30-60 frames compared to an I-frame-only approach.
Yet for live-streaming delivery, the use of P and B frames has the potential to cause significant disruption. Going back to the three-legged stool, the emphasis shifts to one of timely encoding and delivery. In a live-streaming scenario, speed is paramount, with quality and bandwidth being secondary.
In fact, to achieve true live encoding at zero latency--we'll define this term a bit later--the timing window is incredibly short: Live content shot on cameras at 60 fps (e.g., 1080p60 or 4K60) requires a frame to be both compressed and delivered every 0.016 seconds, or every 16 milliseconds (ms).
And that's not even the whole story: While a frame must be displayed every 16 ms, the transmission process takes time too, as does the packetizing process, to move the encoded video into Ethernet packets for delivery across an IP network. That means that the encoding of a frame of video typically must take place in half the time for delivery (i.e., around the 8-ms range) if video is going to be delivered at zero latency.
Which brings us back around to the Achilles' heel of interframe streaming video: P and B frames. Since the encoder needs to compare multiple frames within the GoP to save bandwidth, the use of these P or B frames inherently adds additional latency.
So what can be done to address the balance of speed, quality, and bandwidth (cost)? To think about what might be, let's first examine a typical use case where zero latency might be needed.
In a live-venue setting, any latency is enough to cause visual discomfort. We've probably all experienced this visual discomfort at some point in settings where the presenter might be right in front of the audience in-person, as well as being projected onto a big screen in the same room.
If the presenter raises her hand, and the encoder requires even a dozen or more extra frames to encode, the result will be a one-Mississippi, two-Mississippi delay between her movement and what appears on the projection screen.
Worse still, if the presenter is using a computer that's being projected onto a big screen, visual discomfort for the presenter can occur at around three frames of latency if she tries to interact with a big screen while using a computer mouse on that screen.
So if it's disconcerting to the local audience and to the local presenter, why would compression be used at all?
That is the argument made by the audio visual (AV) industry over the past decade as it attempted to reach a point where technology advances allowed video signals to be sent at zero latency across IP. The need for zero latency is also the reason that almost all IMAG solutions installed in large lecture halls, sports arenas, and music venues are still primarily running on non-packetized, point-to-point solutions.
The AV industry and the streaming industry both use the term "latency" to describe delay. But where the streaming industry uses "low latency" or "ultra-low latency" to describe, respectively, up to 5 seconds of delay and up to 1 second of delay, the AV industry started off making a much bolder assertion: zero latency.
In some ways, this "zero latency" reference was born of necessity, as multiple-input, multiple-output video switches--referred to as a matrix switch, although somewhat akin to an old-school telephone switchboard--were able to deliver a matrix of inputs to one or more outputs, in configurations up to 128 simultaneous outputs, at latency rates that were less than 1 ms.
Switching the Switches
The way these point-to-point solutions first worked in the 1990s was through the use of five-wire RGBHV cables that individually delivered three colors (red, green, blue) and two types of image synchronization (horizontal and vertical sync). The cabling was expensive (several dollars per foot), and the terminations were clumsy BNC connectors. The back of even a simple 16-input, 16-output (16x16) matrix switch would require 160 BNC connectors, and these units ranged up to 128x128 configurations (that were easily the size of a standard refrigerator) to accommodate more than 1,250 individual BNC connectors.
The benefit of these RGBHV (and subsequent HDMI) matrix switches was that interlaced content could be replicated through the cable at absolutely no latency. In essence, a matrix switch was just a really expensive combination signal booster and distribution amp sitting in the middle of a long video cable that could be used to send the signal up to 100 feet with no signal degradation.
A brief side note here: The switch from RGB HV to HDMI cabling added a bit of a twist, as HDMI content was primarily in a progressive format (where the frame is presented as a single image) rather than interlaced (the image is a series of interlaced odd-even lines). While HDMI could support 1080i and 1080p, RGBHV cabling could only support 1080i. The trade-off to progressive content (e.g., 720p, 1080p, 2160p) meant that the terminology needed to shift from zero latency to zero-frame latency. While some solutions still claim zero latency, any progressive content necessitates transmission of a full frame rather than a portion of a frame.
Once the signal needed to be moved beyond the lecture hall, though, even standard RGBHV or HDMI video cabling didn't work--and in some cases, such as 100-plus-feet HDMI cables, didn't exist--so a new solution was required. A few years ago, the form of delivery from an end point to the matrix transitioned from expensive, purpose-built video cabling to much less costly structured wiring. Typically, these were inexpensive, unshielded four-pair Cat5e or Cat6 cabling terminated to an RJ-45 or Ethernet connector (or unshielded twisted pair, or UTP) capable of delivering a baseband video signal up to 100 meters (m) or 330 feet.
This switch to UTP inputs and outputs at the video matrix allowed AV integrators to use existing copper Cat5e and Cat6 wiring in buildings, even though the cabling was not delivering IP signals, but even copper Cat6 wiring is limited to transmission distances of 100 m. This use of UTP cabling, though, opened up the possibility of gathering a video from multiple classrooms to a centralized matrix switch. Yet the basic premise remained the same: point-to-point inputs and outputs into a non-IP video matrix switch.
The move to UTP led to some intentional marketing confusion (names such as AV-over-Cat5 or HDBaseT) as IT professionals, seeing the cabling, might assume that it was standard IP-based video delivery. This confusion also led to a few years of unintentional mishaps, such as the fairly regular occurrence when an AV-overCat5e cable--with non-standard power pinouts, compared to traditional Power over Ethernet (PoE) pinouts--was inadvertently plugged into, and ultimately fried, an IT-department Ethernet switch.
"HDBaseT is not a solution to address streaming demands," says Paul Shu, president of Arista, a company that manufactures industrial computing solutions for healthcare, hospitality, and other mission-critical market verticals. "HDBaseT is intended to address the distance challenges that some pro AV applications encountered, a solution to extend the distance beyond what HDMI can reach."
Justin Kennington, president of the Software-Defined Video over Ethernet (SDVoE) Alliance, explains just how exacting the expectations were for sub-frame delivery times that had been delivered by these RGBHV cables and, later, the structured wiring of Cat5e or Cat6: "We couldn't move the industry away from the comfortable, familiar matrix switch until a technology existed that could truly duplicate its performance." Says Kennington, "An HDBaseT matrix switch [delivers video] in dozens of mi croseconds, far below the threshold of human perception."
The AV industry is now attempting, for the third time in a decade, to replace the matrix switch with the Ethernet switch. According to Kennington, the financials will drive the move--he estimates the cost of a 48-port 10G Ethernet switch at approximately $5,000 versus a 48x48 video matrix switch at around $59,000--as long as the IP-based technology can meet the same zero-frame requirements of UTP or HDMI cabling.
FPGA to the Rescue
One of the solutions the AV industry homed in on, at least 3 years before the streaming industry starting considering the benefits, was the use of a field-programmable gate array (FPGA) to provide massive parallel encoding. AptoVision, a company with expertise in packaging FPGA and Ethernet physical components ("phys" in networking and chip manufacturing lingo), developed the encoding technology that's now known in the AV market as SDVoE.
"SDVoE end-to-end latency is around 100 microseconds or 0.1 milliseconds," says Kenning ton. Noting how SDVoE rivals the speed of HDBaseT while also allowing content to be packet ized and delivered as IP across lower-cost Ethernet switches, he adds, "SDVoE is built the way it is because that is what is required to match the video performance of a matrix switch."
Given the advances in FPGA encoding for H.264 (AVC) and H.265 (HEVC), some in the streaming industry might argue that frame-by-frame or I-frame AVC or HEVC might work for these zero-frame latency use cases, but profes sion al AV integrators see stan dard streaming video codecs as falling short of the usecase requirements.
"The SDVoE compression codec, when enabled, adds five lines of latency," according to Kennington. "At 4K UHD, 60Hz that's 7.5 microseconds, which blows away even I-frame only AVC/HEVC, etc."
Kennington is correct in this regard, since the MPEG codecs inherently have been geared toward delivering bandwidth savings across multiple frames, whereas codecs designed for zero-frame latencies are designed to encode video well under 16 ms (or 16,000 microseconds).
Ryohei Iwasaki, executive director of IDK Corp. (HQ) and CEO of IDK America, a company manufacturing professional AV video gear, further explains why there's room for more than just the standard-based MPEG codecs in the marketplace: "We are not comparing between SDVoE and H.264/265 since IDK's thinking is that usage and purpose of those codecs are different.
"We decided to go with a 10Gbps AV solution since [a] 4K signal has 18GB," Iwasaki continues, referring to the fact that an uncompressed 4K60 8-bit video signal is in the 14Gbps range, but rises to 18Gbps when accounting for the word-bit conversion (8b/10b) that HDMI requires for transmitting a 4K60 signal across an HDMI cable.
"We tested many other codecs' functionalities and scalabilities for the future," Iwasaki says, "and IDK thought that SDVoE is the one to adapt for now as it satisfies most of pro AV customers' requirements."
Ethernet switches are not measured in the 8b/10b word-bit conversion--in fact, a 1Gbps Ethernet switch uses 4b/5b and actually transmits at 1.25Gbps, but is marketed as a 1Gbps switch to avoid any confusion--meaning the compression is fairly light (about 1.4:1) for a 4K60 8-bit signal streamed using the SDVoE approach.
Kennington says that SDVoE also considered other codecs as it developed the SDVoE FPGA-10G phys package. "When the groundwork for what became SDVoE was laid, we did investigate the existing codecs [including the] MPEGs and JPEGs and others. What we found is that they all made too many compromises in the name of bandwidth savings."
As Kennington explains, "The JPEG-style co decs try to make the same compromise we do: reduce compression efficiency in exchange for better latency and/or image quality. But we find they simply don't go far enough."
Kennington then puts a stake through the heart of the JPEG-based codec option for these high-resolution, zero-frame latency use cases by pointing out, "The original DCT [discrete cosine transform]-based JPEG suffers from ringing and block artifacts. And," he continues, "wavelet-based JPG2000 has its own problems, especially with high-res computer graphics and certain color transitions, where luma is relatively constant and chroma is changing."
These issues with luminance and chrominance are inherent to certain DCT encoding approaches. In fact, DCT could kindly be considered a long-in-the-tooth approach, since it dates back almost 30 years to the advent of the JPEG still-image compression.
Kennington also notes that, at least from a peak-signal-to-noise ratio (PSNR) quality metric standpoint, the SDVoE solution fares better than JPEG. "Our codec scores for PSNR are often much better than JPEG." He gives this example: "[O]n the yacht club image (go2sm .com/yachtclub) we scored a 57dB, compared to 45.5dB for the highest-quality JPEG example shown."
While H.264 and H.265 don't necessarily suffer the same fate as JPEG, they do share similarities that may make them less than ideal for use as high-resolution I-frame codecs for the AV-over-IP integration market.
"Standardized MPEG codecs can be tuned to reduce latency, but that's coming at the expense of image fidelity, and vice versa," Kennington says.
Bandwidth Is Cheap
While the concept of using a 10Gbps Ethernet switch to live stream 4K60 8-bit or even 10-bit content might sound like overkill, Kennington explains the reasoning for using the codec triangle. "In pro AV, we simply don't require the kind of bandwidth savings that interframe compression is optimized for." He goes on to note that most AV-over-IP solutions run at a full 1Gbps or even 10Gbps versus the standard 2.5Mbps or 6Mbps for a streaming video delivery from Netflix.
Referring to the "fairly light" compression for 4K60 content (essentially a 1.4:1 compression ratio), Kennington also provides an answer to a question I'd had about video at data rates below 10Gbps: "SDVoE's codec doesn't even use compression unless it is required. Since a 1080p60 8-bit stream is only 3 gigabits per second, we transmit that without any loss at native data rate. Same for 4K30 at 6Gbps. We only compress signals above [the] 10Gbps raw data rate, like 4K60. And we only compress by the minimum amount required to fit into the 10G Ethernet pipe."
That naturally raises the question of why the pro AV market has settled into the use of a 10G switch for video streaming. After all, a 10G switch is still much more expensive than a 1G switch. Kennington believes it comes down to AV integrators being able to visualize the "trade-off between image quality, latency, and bandwidth."
The cheapest part of the overall equation for an AV integration, at least one that sits within a single physical location, such as a school or college campus, is the bandwidth. This, Kennington explains, is where AV differs from longdistance streaming: "[I]n pro AV, the latency requirements are basically fixed on a per-use-case basis. Image quality demands are going up--higher resolutions, higher frame rates, higher color bit depths--but bandwidth is unique since bandwidth on an Ethernet switch is cheap and getting cheaper. So use it!"
Kennington agrees that other approaches to moving content across an Ethernet network are valid, adding, "Far be it from me to say Netflix isn't successful!" But he notes that these approaches "create latency penalties and compromise image quality in ways that the pro AV market cannot easily accept."
A Middle Ground?
IDK's Iwasaki notes there is a need for a middle ground between the very high data rate of an SDVoE codec and the typical live-streaming requirement for someone sending a stream from one city or content to another: "Some customers need to stream the video longer distance, for example from Japan to the US. In that case, the customer needs to minimize bandwidth using [another] codec like H.264/265. IDK is also preparing a unit which can bridge SDVoE and H.264 for this purpose."
Iwasaki adds that the bridge unit is still a concept, and that--to avoid concatenation issues and maintain proper color space--the SDVoE video would be decoded back to base-band video and then re-encoded in H.264 for standard streaming delivery.
"At this year's InfoComm," Iwasaki says, "we are going to have a prototype concept encoder which can capture, streaming out, image from our receiver unit and control from our management system. These concepts help people who want to integrate a real-time solution and real out-going streaming signal together. The only current way to do it requires decoding the signal to baseband once [between encoding in H.264 and SDVoE]. Maybe the SDVoE Alliance will provide direct re-encode capability in the future."
Iwasaki also points out that recording a presentation in the SDVoE codec is not yet possible, and the SDVoE Alliance's Kennington confirms that the SDVoE codec is only for use in live-transmission scenarios. That's where a standards-based codec like H.264 or H.265 would come into play.
"If a customer wants to have recording or network streaming capability for these signals," Iwasaki posits, "H.264/265 will be used since it can reduce the bandwidth of the signal using high compression."
Losing latency won't be relevant for recorded content, according to Iwasaki, but the loss of video quality using an MPEG-based video codec will still be apparent for high-resolution content.
A New Stool?
Kennington also suggests what might be a new three-legged stool for the streaming industry to begin measuring itself against for proper balance in encoding and delivery: "Latency, price, and power consumption loom large over this discussion of quality and bandwidth."
To get to zero-frame latencies requires an extraordinary amount of computational power, and Kennington notes that existing standards-based MPEG codecs have price and power consumption issues beyond just the fundamental quality and latency questions.
"The computational complexity of those algorithms is also much, much higher, which has implications on cost and power consumption," Kennington says, "especially in a realtime encoder. The only chip I'm aware of for live HEVC encode is from Socionext, costs over $1,000, and consumes over 35 watts, where our partner manufacturer end points sell for $1,000 to $2,000." While he doesn't want to speak to complete details in this forum, he does say, "[W]e're more than 85% better on price and power than that."
As we close, here's a reminder that the AV and streaming industries are on parallel paths. In many ways, the two industries are separated only by a slightly different language and differing approaches around specific liveuse cases.
The AV industry has a bit of a flawed understanding of what typical streaming--especially the classic video-on-demand premium asset that's encoded into thousands of 2-10 second HLS segments--entails in terms of latencies. Walk the show floor at an AV industry event like InfoComm, and you'll often hear a zero-latency proponent talk about on-demand encoding that requires racks of servers and up to a week of encoding time to "get it right" for quality encoding.
Yet the AV industry fairly questions the efficacy of H.264 and H.265 because both are based on DCT and therefore introduce a number of problems that find the codecs stepping on their own feet when trying to compete in the zero-frame latency dance-off.
Is it time for a new approach to codecs, with a single codec handling zero-frame latency for local delivery as well as scalability for very-low-latency remote delivery? The answer is a definitive, "Yes," and we, as the streaming industry, would do well to step up our game in driving down latency, price, and power consumption in this new era of IP video delivery.
BY TIM SIGLIN
Tim Siglin (firstname.lastname@example.org) is a streaming industry veteran and longtime contributing editor to Streaming Media magazine.
Comments? Email us at email@example.com, or check the masthead for other ways to contact us.
|Printer friendly Cite/link Email Feedback|
|Date:||Jul 1, 2019|
|Previous Article:||What About the Hardware? Hardware-assisted encoding is changing the live transcoding game. Here's a look at how the players from NVIDIA, Intel, and...|
|Next Article:||2019: A BRAVE NEW (CODEC) WORLD: The old realities that used to dictate codec adoption no longer apply. How are HEVC, AV1, and VVC positioned for the...|