What About the Hardware? Hardware-assisted encoding is changing the live transcoding game. Here's a look at how the players from NVIDIA, Intel, and NGCodec perform.
This article analyzes several hardware-based transcoding solutions. For H.264, we compare the NVIDIA H.264 and Intel Quick Sync codecs, analyzing performance and output quality for live transcoding applications. For perspective, we also included FFmpeg's x264 codec using both the medium (default) and veryfast settings. For HEVC, we evaluated Intel's SVT (Scalable Video Technology)-HEVC, a software-based codec that purports to deliver hardware-like performance; NGCodec's FPGA-based HEVC encoder; and x265 using the medium and veryfast presets.
In both cases, we measured performance using the encoding ladder shown in Table 1 (on the next page) with 1080p 60 fps source clips. That is, on each tested computer, we tested whether the codec could produce the entire ladder, and if so, how many simultaneous instances of the ladder it could produce. For software encoders, we allowed frame rates to drop to 55 fps, while for hardware-based encoders, we allowed no dropped frames.
In a perfect world, we could have tested all codecs on a single computer to arrive at a uniform cost-per-stream hour. However, the platforms used for hardware-assisted encoding were almost always suboptimal for software-only encoding, which frustrated these efforts. In addition, you'll get different measures of comparative performance based on machine type, and finding the optimal configuration for software and three-hardware based codecs could easily be the topic of a totally separate article. Long story short, we include pricing information for all test encodes, but you'll likely have to do a lot more work to identify the most economically effective instance types for your production transcodes.
After assessing performance, we measured quality via standard rate-distortion curves with BD-Rate analysis. We also measured the subjective quality of the 3Mbps streams produced in each category with web service Subjectify.us. The encoding parameters used for each set of tests are identified below.
We measured objective metrics--Video Multimethod Assessment Fusion (VMAF) and peak signal-to-noise ratio (PSNR)--with four 2-minute test clips. These included segments from Netflix's Meridian and Harmonic's Football test clips, plus the GTAV test clip (2x the 1-minute clip) and a 2-minute compilation of Netflix clips from Xiph.org, including Dinner-Scene, Narrator, Square And Time Lapse, and BarScene (go2sm.com/xiphtest).
NGCodec suggested subjective testing later in the process after the first round of encodings and objective testing was completed. Ordinarily, you would perform objective and subjective tests using the same clips. However, Subjectify.us recommends test clips no more than 20 seconds long, so for these tests, we excerpted the first 20 seconds of each clip and tested those.
Multiple NVIDIA GPUs contain one or more hardware-based encoders or decoders, which are separate from the CUDA cores, freeing both the graphics engine and CPU for other tasks (go2sm.com/nvidiasdk). We tested the NVIDIA H.264 encoder using a G3.4xlarge AWS workstation set up for us by engineers at Softvelum, who have significant experience with NVIDIA-based transcoding to support video producers using its Nimble Streamer cloud transcoder. AWS G3 instances include NVIDIA Tesla M60 GPUs that are used during the hardware encode. As with all AWS instances, pricing varies widely based on commitment level, with the Linux spot price at $1.14 per hour when we tested.
I derived the NVIDIA script from a white paper titled, "Using FFmpeg With NVIDIA GPU HW Acceleration" (go2sm.com/nvidiaffmpeg; membership required), ultimately using the following script:
ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264 _ cuvid -i input.mp4 -c:v h264 _ nvenc -preset medium -b:v 5M -bufsize 5M -maxrate 5M -qmin 0 -g 120 -bf 2 -temporal-aq 1 -rc- lookahead 20 -i _ qfactor 0.75 -b _ qfactor 1.1 output.mp4
This differed from the NVIDIA recommendations in two meaningful ways. First, substituting the medium preset for the recommended slow to improve performance, and second, limiting the buffer to 1x the data rate to minimize bitrate variability. We also changed the key frame interval from 250 frames to 120. We ran test encodes with the original script and the final, and the VMAF rating of the video produced by the final script was actually a bit higher, 82.19 to 81.82.
Using the final script, we were able to produce two simultaneous encoding ladders on the G3.4xlarge instance for a cost per ladder of about 57 cents per hour. In discussing our findings with NVIDIA, we learned that the company offers much more powerful hardware that provides much great encoding density, which obviously will impact the cost per ladder.
We used the following command script for the medium and veryfast x264 encodes, obviously changing the preset as needed:
ffmpeg -y -re -i input.mp4 -c:v libx264 -preset medium -b:v 5M -bufsize 5M -maxrate 5M -g 120 output.mp4
The NVIDIA-optimized G3.4xlarge computer couldn't produce a single encoding ladder with the x264 codec, even using the veryfast preset. So, we switched to a compute-intensive C5.18xlarge instance, which cost $.9438 per hour (spot pricing) and produced four simultaneous encodes of 55 fps or higher using the veryfast preset, or a cost per ladder of about 24 cents per hour. Using the medium preset, the system eked out two simultaneous encodes for a cost per ladder of about 47 cents per hour.
We ran two separate sets of tests with Intel Quick Sync, both times using scripts recommended by Intel. The first set, the results of which we presented at Streaming Media East, revealed significant transient quality drops in the Football clip. Intel added the highlighted lookahead switches shown in the script below for the second set of tests, which eliminated this problem:
ffmpeg -re -hwaccel qsv -c:v h264 _ qsv -y -i input.mp4 -filter _ scale _ threads 4 -c:v h264 _ qsv -vf hwupload=extra _ hw _ frames=64,format=qsv -preset 4 -b:v 5M -maxrate 5M -bufsize 5M -g 120 -idr _ interval 2 -async _ depth 5 -look _ ahead 1 -look _ ahead _ depth 30 output.mp4
You can see the difference in Figure 1, which shows the VMAF scores of the Intel Quick Sync clips produced with the lookahead (in red) and without the lookahead (in green) as displayed in the Moscow State University Video Quality Measurement Tool. The green downward spikes each represented very noticeable transient quality drops that the updated encoding parameters with the lookahead obviously eliminated.
We encoded with Intel Quick Sync using preset 4. To choose this, we measured the encoding speed and VMAF quality of each preset by encoding the most challenging clip in our test suite (Football) to 1080p at 3Mbps, yielding the data shown in Figure 2 (on the next page). As you can see, presets 3 and 4 present a good balance between speed and quality, although producers seeking to eke out the last bit of encoding speed could justify preset 6 as delivering about 9% better performance with only a minimal quality drop.
Intel created the test station on a cloud system hosted by phoenixNAP that was driven by a single-socket Intel Xeon CPU E3-1585L v5 running at 3.00 GHz, with 4 cores and integrated Intel Iris Pro Graphics that includes Intel Quick Sync encoding and decoding. phoenixNAP doesn't rent by the hour, but the machine cost was $250 per month, including 15TB of egress data transfer. Best case, if you ran the system 24/7 for a 30-day month, this would translate to about 35 cents per hour.
Figure 2. Choosing the preset for Intel Quick Sync Intel Quick Sync - H264 Performance vs. Quality FPS VMAF Preset 1 128 73.75 Preset 2 202 73.64 Preset 3 239 73.29 Preset 4 239 73.29 Preset 5 247 73.25 Preset 6 260 73.11 Preset 7 275 69.82 Note: Table made from line graph.
Interestingly, without the lookahead parameters, the test system could sustain two simultaneous encoding ladders for a cost per ladder of about $0.175 per hour. With the lookahead parameters in the command string, the system could only sustain one encoding ladder at full frame rate for a single ladder for 35 cents per hour. I realize that comparing monthly pricing versus spot pricing isn't fair, but that's the data I have.
Evaluating the Output
High-volume publishers care about multiple aspects of the output stream, including quality and data rate variability. As we learned from our 2019 NAB Show interview with Twitch's Yueshi Shen (go2sm.com/shen), when you're pushing hundreds of thousands of streams, even slight variations in data rate can cause delivery issues. Figure 3 shows data rate graphs of the four 3Mbps streams from the Football clip from the Hybrik Cloud platform's Media Analyzer feature. You see that the top two streams from Intel and NVIDIA show much less variation than the two x264 streams and are also much closer to the targeted data rate.
Table 2 shows various datapoints regarding the 3Mbps Football stream produced by all technologies. They demonstrate that the hardware codecs were more accurate and less variable, with Intel Quick Sync having a slight advantage over NVIDIA with a lower standard deviation and lower max data rate. If data rate variability is an issue for your live events, you should strongly consider a hardware codec.
Figure 4 is the overall VMAF rate-distortion curve for the four measured technologies, showing NVIDIA with a very slight lead over Intel Quick Sync and x264 medium, with x264 veryfast noticeably behind. There were some variations among the individual test clips, with Intel Quick Sync producing the highest quality in the GTAV and Meridian clips and NVIDIA substantially ahead of Intel Quick Sync in the Football clip.
Table 3 (on the next page) contains the overall BD-Rate computation for PSNR (not VMAF), which shows that by this metric, Intel Quick Sync enjoyed a slight advantage over NVIDIA, again with x264 medium very close and x264 veryfast trailing significantly. For those who are not familiar with BD-Rate computation, reading the top line horizontally shows that Intel Quick Sync can produce the same quality as NVIDIA, x264 medium, and x264 veryfast, with a data rate reduction of .46%, 2.12%, and 16.07%, respectively. Positive numbers indicate that higher data rates will be needed to produce the same quality.
Figure 5 shows the overall subjective ratings gathered by Subjectify.us, with 218 participants choosing the higher quality of 3Mbps versions of the four test clips in round-robin comparisons. These results confirm the overall objective findings that show the two hardware codecs slightly ahead of x264 medium and significantly ahead of x264 veryfast.
Note that there was a great deal of variation in the subjective results on a clip-by-clip basis. For example, NVIDIA enjoyed a significant advantage over Intel Quick Sync in the Football and Meridian test clips, which Intel Quick Sync reversed with a substantial lead in the GTAV clip, where both hardware codecs ranked behind the x264 medium clip. These roughly followed the objective scores, but not completely. If your content is gaming or animation, you should definitely run your own tests with your own videos.
Note that for these H.264 clips, we did not use any tuning mechanisms for the objective benchmarks, because according to Intel, there was no way to disable adaptive quantization or otherwise tune the Intel Quick Sync clips for VMAF or PSNR. As you'll read, we did tune for the HEVC objective benchmarks and did not tune for the HEVC clips tested by Subjectify.us.
Overall, the quality difference among Intel Quick Sync, NVIDIA, and x264 medium wasn't much of a differentiator. For publishers pushing huge stream counts, the data rate stability of the hardware codecs will prove very attractive. But if this doesn't matter to you, it really comes down to the cheapest option.
Again, for HEVC, we tested Intel's SVTHEVC codec, NG Codec's FPGA-based codec, and x265 using two presets: medium and veryfast. Although not technically a hardware-based codec, Intel's SVT line of codecs has been designed to run extremely efficiently on Intel Xeon Scalable processors and Intel Xeon D processors. The HEVC codec has 10 presets, which delivered the performance and quality shown in Figure 6 (on page 34) for the Football clip encoded at 3Mbps. Intel recommended that we test using preset 6, so we did.
The SVT-HEVC codec has three tuning modes--0 to optimize for visual quality, 1 to optimize for PSNR/SSIM, and 2 to optimize for VMAF. Note that the default is 1, so if you don't specify tune 0 for your encode, you won't get optimal visual quality. We used tune 0 for the subjective tests and tune 2 for both VMAF and PSNR. This yielded the following command line, which Intel provided (showing tune 0). For all HEVC encodes, we boosted the buffer size to twice the target data rate to provide a bit more wiggle room for encoding quality.
ffmpeg -SVTnew -i input.mp4 -c:v libsvt _ hevc -tune 0 -rc 1 -preset 6 -b:v 5M -maxrate 5M -bufsize 10M -g 120 output.mp4
We tested the Intel SVT-HEVC encoder on a C5.9xlarge system equipped with an Intel Xeon Platinum 8000 series (Skylake-SP) processor that produced two simultaneous encodes of the full encoding ladder using preset 6 tune 0. Spot pricing on the system was $0.3466 per hour, yielding a cost per ladder of around $0.1733 per hour. On the same system, encoding with the x265 veryfast preset failed to produce a single encoding ladder at the requisite 55 fps. There are certainly larger computers that could produce our encoding ladder using the veryfast preset in real time, but it would definitely be pricey.
We used the script shown below for the x265 encodes, switching between the veryfast and medium presets and tuning for PSNR for the objective testing and not tuning for files produced for the subjective trials:
ffmpeg -re -i input.mp4 -c:v libx265 -preset veryfast -x265- params keyint=120:bitrate=5000:vbv- maxrate=5000k:vbv-bufsize=12000 -tune psnr -pix _ fmt yuv420p output.mp4
NGCodec provided the script below for our testing. Its HEVC codec doesn't have presets and automatically produces constant bitrate (CBR) streams. We used the -aq-mode 0 switch to disable adaptive quantization for our objective tests and removed the switch to prepare the 3Mbps files for subjective testing:
ffmpeg -y -re -i input.mp4 -c:v NGC265 -b:v 5M -g 0 -idr-period 120 -aq-mode 0 output.mp4
We tested on an FPGA-based cloud computer (AS-f1.2fx8c) hosted by Altered Silicon, which featured two FPGA cards and cost $2.21 per hour, including the NGCodec software. We were able to create one stream for the entire card, but NGCodec claims that by the time you read this article, you should be able to produce up to two complete ladders per FPGA, for a cost of about 54 cents per ladder per hour. If you consider the NGCodec system, you should verify this performance up front.
Figure 7 (on the next page) shows the data rate variability of the four encodes, with NGCodec noticeably tighter than Intel and the two x265 encodes.
Table 4 (on the next page) presents the numbers supporting these figures, with NGCodec showing a much lower standard deviation than any of the other three technologies, confirming the tighter pattern. At least in terms of the tightness of the data rate pattern, SVT-HEVC performs more like a software codec than a hardware codec.
Figure 6. SVT-HEVC's quality and performance by preset SVT-HEVC Preset Quality and FPS Output (Football Clip @ 3 Mbps) FPS VMAF Preset 1 26 82.46 Preset 2 26.23 81.35 Preset 3 64 81.35 Preset 4 118 80.10 Preset 5 151 79.46 Preset 6 190 78.26 Preset 7 205 77.64 Preset 8 262 76.80 Preset 9 279 75.88 Preset 10 318 73.80 Note: Table made from line graph.
Figure 8 shows the overall rate-distortion curve for the four clips using the VMAF metric, which has x265 medium in first place, followed by NGCodec, x265 veryfast, and SVT-HEVC in that order. Again, since our fastest test system couldn't even produce a single x265 encoding ladder using the veryfast preset, the x265 medium stream isn't a viable choice for most producers.
Table 5 shows PSNR (not VMAF) BD-Rate computations for the four HEVC codecs in the same order. SVT-HEVC might have performed slightly better had we encoded with tune 1 (PSNR/SSIM) rather than tune 2 (VMAF), but the difference would most likely not have changed the distribution order. If you prefer PSNR over VMAF, you should run your test encodes using tune 1.
Figure 9 shows the average subjective results from Subjectify.us for all of the tested clips, which rated NGCodec the highest by a substantial margin, followed by x265 medium and then SVT-HEVC. Again, the results varied widely by the clip. For example, SVTHEVC actually rated highest in the Meri dian clip, followed very closely by NGCodec and Intel, with x265 medium tied for the lead with NGCodec in the Dinner Scene clip.
In the HEVC trials, NGCodec delivered better quality than SVT-HEVC and a tighter distribution pattern, with a reasonable cost per ladder per hour, assuming NGCodec's performance claims stand up. Intel's SVT-HEVC technology is relatively new, so it will likely improve over time and is definitely worth checking out for video-on-demand testing since its performance is so tunable.
Overall, while subjective tests produced similar results to our objective benchmarks in the H.264 trials, they varied significantly for HEVC. Although I personally trust objective metrics for intra-codec configuration decisions like choosing a preset or key frame interval, I'm less confident in the accuracy of objective metrics when comparing different encoders or codecs. Our Subjectify.us costs were well under $500 and well worth the expense. Note that Intel and NG Codec split this cost, which is greatly appreciated.
This series of tests represents our first extensive venture into hardware-based transcoding. We appreciate the assistance from all codec vendors, as well as Softvelum and Subjectify.us, and couldn't have produced this article without it. However, given the sheer number of technologies, configurations, and datapoints measured, it's likely (if not certain) that some errors exist, for which the author takes sole responsibility. Please check the online version of this article for comments and (sigh) corrections before making any technology decisions or starting your own test series.
By Jan Ozer
Jan Ozer ( firstname.lastname@example.org) is a contributing editor to Streaming Media magazine and the author of Learn to Produce Videos With FFmpeg in Thirty Minutes or Less, which is available on Amazon. See streaminglearningcenter.com/learnffmpeg for more information.
Comments? Email us at email@example.com, or check the masthead for other ways to contact us.
Table 1. The standard encoding ladder Resolution Data Rate 1080p60 6Mbps 1080p30 4Mbps 720p30 2.5Mbps 540p30 1.2Mbps 360p30 .8Mbps Table 2. Stream variability of the H.264-encoded streams Data Rate Standard Max Deviation Data Rate Intel Quick Sync 2969 139 3486 NVIDIA 2965 160 3669 x264 medium 2885 295 3497 x264 veryfast 2818 327 3514 Table 3. H.264 BD-Rate computations for PSNR (not VMAF) PSNR Intel NVIDIA x264 x264 Quick Medium Veryfast Sync Intel Quick Sync X -0.46 -2.12 -16.09 NVIDIA 0.46 X -1.81 -15.85 x264 medium 2.17 1.85 X -14.03 x264 veryfast 19.18 18.84 16.32 X Table 4. Stream variability of the H.264 encoded streams Data Rate Standard Deviation Max Data Rate Intel SVT-HEVC 3013 253 3897 NGCodec 3076 149 3548 x265 medium 2990 253 3661 x265 veryfast 2989 240 3652 Table 5. HEVC BD-Rate computations for PSNR (not VMAF) PSNR NGCODEC SVT-HEVC-P6 x265 Medium x265 Veryfast NGCODEC X -17.64 5.86 -8.24 SVT-HEVC-P6 21.41 X 28.57 11.11 x265 medium -5.53 -22.22 X -13.33 x265 veryfast 8.98 -10.00 15.38 X
|Printer friendly Cite/link Email Feedback|
|Date:||Jul 1, 2019|
|Previous Article:||THE STATE OF PTZ OVER NDI.|
|Next Article:||ZERO-SUM GAME: IN THE AV WORLD, ZERO-FRAME LATENCY ISN'T JUST A PIPE DREAM--IT'S A REQUIREMENT. THE STREAMING INDUSTRY WOULD DO WELL TO PAY ATTENTION.|