[转载]结合可编程DSP和FPGA实现高效H.264/AVC视频编码

haichao · 发表于 2006-1-20 03:53:07

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？注册

x

本帖最后由 cjsb37 于 2013-4-29 09:19 编辑

By Jerry Banks and Wilson Chung

The convergence of broadband communications and consumer applications is enabling a new set of video applications in the consumer market. Currently, nearly every electronic device manufacturer is considering incorporating some type of video processing solution. This trend is driving the demand for advanced video codecs that can provide high quality video at lower bit rates than traditional video codecs. This article describes how combining the power of DSPs and FPGAs to implement a high-performance H.264/AVC video coding standard can increase efficiency and meet the demands of these emerging consumer video applications.
Market research has predicted enormous growth opportunities for devices such as video cell phones, set-top boxes, next-generation DVD players, and home gateways. See Figure 1 for a graph indicating projected advanced codec shipments in millions.

Historically, MPEG-2 has been the primary video coding standard used for consumer oriented applications. Recent efforts between Microsoft and the ISO/IEC and ITU-T organizations to develop the H.264/AVC video codec (MPEG-4 part 10, not part 2) utilizing Windows Media Video 9 (WMV9) have pushed these two new codecs to the forefront. It appears both will play a significant role in fueling emerging applications. The DVD Forum and the European DVB Consortium have selected the H.264/AVC format as one of the next-generation video codec standards. These announcements, plus endorsements from Hollywood studios, content distributors, and broadcast infrastructure providers have validated the importance of this important new coding standard.
H.264/AVC’s promise of improved coding efficiency with increased algorithmic complexity when compared with previous video coding standards (high quality at half the bit rate), presents tremendous engineering challenges to system architects, DSP engineers, and hardware designers. The H.264/AVC standard has ushered in the most significant changes and algorithmic discontinuities in the evolution of video coding standards since the introduction of H.261 and MPEG-1 in the early ’90s.
The issue at hand is that the promised high quality and low bit rate of H.264/AVC requires extremely high performance signal processing to encode the video stream in real-time environments such as broadcast and video conferencing systems. A typical high-performance, multimedia-programmable DSP processor, such as the Texas Instruments TMS320 DM642, is capable of handling these performance requirements in real time at reduced resolutions, for a number of channels, and/or for tandem coding (simultaneous encode and decode). Adding high-performance Xilinx FPGA coprocessor(s) to perform some of the more computationally intensive algorithmic tasks allows these processors to extend their scope of real-time encode and decode support of any advanced video codecs to include Standard Definition (SD) and High Definition (HD) resolutions, and higher channel density.
System level and algorithmic complexity considerations
The algorithmic computational complexity, data locality, and algorithm and data parallelism required to implement the H.264/AVC coding standard often directly influence the overall architectural decision at the system level. In turn, this determines the ultimate cost of developing any commercially viable H.264/AVC system solution in the broadcasting, video editing, teleconferencing, and consumer electronics fields.
To achieve a real-time H.264/AVC SD or HD resolution encoding solution, system architects often employ multiple FPGAs and programmable DSPs. To understand the computational complexity and memory access requirements, we profiled the H.264/AVC encoder software model provided by the Joint Video Team (JVT), comprising experts from ITU-T’s Video Coding Experts Group (VCEG) and ISO/IEC’s Moving Picture Experts Group (MPEG). As primary candidates, we identified the motion estimation, macro-block/block processing (including mode decision), and motion compensation modules. As the secondary candidates for FPGA hardware acceleration, we identified block transform coding, de-block filtering, and entropy coding modules.
FPGAs as high performance coprocessors
Figure 2 illustrates where conventional programmable DSP processors fit today in terms of various video codec complexity, the ability to handle multiple video streams, and/or higher video resolutions. By combining the extremely high performance capabilities of Xilinx FPGAs, such as the Virtex-4 and the Spartan-3 family devices, with the Texas Instruments TMS320 DM64x family DSP processors, we can increase the scope of the system design solutions. This enables flexibility, programmability, and computational acceleration of hardware for the system designers in the hardware and software functional partitioningd. By utilizing FPGAs as coprocessors, we can support more complex video coding standards, such as H.264/AVC in profiles and levels, more video channels, and encoding and decoding in tandem.

What to accelerate?
In system design, computation complexity alone does not determine if a functional module should be mapped to hardware or remain in software. To evaluate the viability of software and hardware partitioning of the H.264/AVC coding standard implementation on a platform that consists of a mixture of FPGAs, programmable DSPs, or general-purpose host processors, considering a number of architectural issues that influence the overall design decision is essential.
Data locality
In a synchronous design, the ability to access memory in a particular order and granularity while minimizing the number of clock cycles due to latency, bus contention, alignment, DMA transfer rate, and the types of memory (such as SRAM, and SDRAM) is very important. The physical interfaces between the data unit and the arithmetic unit, or the processing engine, primarily dictate the data locality issue.
Data parallelism
Most signal processing algorithms operate on data that is highly capable of parallelism (such as FIR filtering). Single Instruction Multiple Data (SIMD) and vector processors are particularly efficient for formulating parallel data or a vector format (long data width).
FPGA fabric exploits this by providing a large amount of block RAM to support numerous very high aggregate bandwidth requirements. In the new Xilinx Virtex-4 SX device family, the amount of block RAM matches closely with the number of Xtreme DSP slices:
SX25: 128 block RAM, 128 DSP slices
SX35: 192 block RAM, 192 DSP slices
SX55: 320 block RAM, 512 DSP slices
Signal processing algorithm parallelism
In a typical programmable DSP or a general-purpose processor, engineers often refer to signal processing algorithm parallelism as Instruction Level Parallelism (ILP). A Very Long Instruction Word (VLIW) processor is an example of such a machine that exploits ILP by grouping multiple instructions (ADD, MULT, and BRA) for execution in a single cycle. A heavily pipelined execution unit in the processor is also an excellent example of hardware that exploits the parallelism. Modern programmable DSPs have adopted this architecture, including the Texas Instruments TMS320C64x.
Nonetheless, not all algorithms can exploit such parallelism. Recursive algorithms such as FIR filtering, Variable Length Coding (VLC) in MPEG1/2/4, Context-Adaptive Variable Length Coding (CAVLC), and Context-Adaptive Binary Arithmetic Coding (CABAC) in H.264/AVC are particularly sub-optimal and inefficient when mapped to these programmable DSPs. This results from data recursion preventing effective ILP use. Instead, engineers can efficiently build dedicated hardware engines in the FPGA fabric.
Computational complexity
Programmable DSP is bounded in computational complexity, as measured by the clock rate of the processor. Signal processing algorithms implemented in the FPGA fabric are typically computationally intensive. Some examples of these are the Sum of Absolute Difference (SAD) engine in motion estimation and video scaling. By mapping these modules onto the FPGA fabric, the host processor or the programmable DSP has the extra cycles for other algorithms. Furthermore, FPGAs can have multiple clock domains in the fabric, thus selective hardware blocks can have separate clock speeds based on their computational requirements.
Theoretic optimality in quality
Any theoretic optimal solution based on the rate-distortion curve can be achieved if and only if the complexity is unbounded. In a programmable DSP or general-purpose processor, the computational complexity is always bounded by the clock cycles available. FPGAs, on the other hand, offer much more flexibility by exploiting data and algorithm parallelism by means of multiple instantiations of the hardware engines, or increased use of block RAM and register banks in the fabric.
The number of instruction issues per cycle, the level of pipeline in the execution unit, or the maximum data width to fully feed the execution units often limit a programmable DSP or general-purpose processor. Video quality is often compromised from the limited cycles available per task in a programmable DSP, whereas hardware resources are fully allocated in FPGA fabric (three-step vs. full-search motion estimation).
Implementing functional modules onto FPGAs
Figure 3 shows the overall H.264/AVC macroblock level encoder with major functional blocks and data flows defined. One of the primary successes of the H.264/AVC standard is its ability to predict the values of the content of a picture to be encoded by exploiting the pixel redundancy in different ways and directions not exploited previously in other standards. Unfortunately, when comparing this to previous standards, it increases the complexity and memory access bandwidth approximately four-fold.

Improved prediction methods
The following paragraphs highlight some of the H.264/AVC video coding standard’s main features that enable enhanced coding efficiency, and present an evaluation of these functional modules based on the design criteria discussed in the previous section.
Quarter-pixel-accurate motion compensation
Older standards use half-pixel motion vector accuracy. The new design improves this feature by providing quarter-pixel motion vector accuracy. The prediction values at half-pixel positions are calculated by applying a one-dimensional, six-tap FIR filter [1, -5, 20, 20, -5, 1]/32 horizontally and vertically. Prediction values at quarter-pixel positions are generated by averaging samples at the full- and half-pixel positions. These sub-sampling interpolation operations can be efficiently implemented in hardware inside the FPGA fabric.
Variable block-sized motion compensation with small block size
The standard provides more flexibility for the tiling structure in a macroblock size of 16 x 16 pixels. It enables the use of 16 x 16, 16 x 8, 8 x 16, 8 x 8, 8 x 4, 4 x 8, and 4 x 4 sub-macroblock sizes. Because of the increasing combinations of tiling geometry with a given 16 x 16 macroblock, finding a rate distortion optimal tiling solution is extremely computationally intensive. This additional feature places an enormous burden on the computational engines used in motion estimation, refinement, and the mode decision process.
In-the-loop adaptive deblocking filtering
The deblocking filter has been successfully applied in H.263+ and MPEG-4 part 2 implementations as a post-processing filter. In H.264/AVC, the deblocking filter is moved inside the motion-compensated loop to filter block edges resulting from the prediction and residual difference coding stages of the decoding process. The filtering is applied on both 4 x 4 block and 16 x 16 macroblock boundaries in which two pixels on either side of the boundaries can be updated using a three-tap filter. A content adaptive, non-linear filtering scheme governs the filter coefficients, or strength.
Directional spatial prediction for intra coding
In cases where motion estimation cannot be exploited, intra-directional spatial prediction is used to eliminate spatial redundancies. This technique attempts to predict the current block by extrapolating the neighboring pixels from adjacent blocks in a defined set of directions. The difference between the predicted block and the actual block is then coded.
This approach is particularly useful in flat backgrounds where spatial redundancies exist. There are nine prediction directions for Intra_4x4 prediction, and four prediction directions for Intra_16x16 prediction. Note that the data causality imposes quick memory access to the neighboring 13 pixel values above and to the left of the current block for Intra_4x4. For the Intra_16x16, 16 neighboring pixels on each side predict a 16 x 16 block.
Multiple reference picture motion compensation
The H.264/AVC standard offers the option for multiple reference frames in the inter-frame coding. Unless the number of referenced pictures is one, the index at which the reference picture is located inside the multi-picture buffer has to be signaled. The multi-picture buffer size determines the memory usage in the encoder and decoder. These reference frame buffers must be addressed correspondingly during the motion estimation and compensation stages in the encoder.
Weighted prediction
The JVT recognizes that in encoding certain video scenes that involve fades, having a weighted motion-compensated prediction dramatically improves the coding efficiency.
Improving coding efficiency
In addition to improved prediction methods, other parts of the standard design were enhanced for improved coding efficiency. Two additional features described below are most likely to affect the overall system architecture based on our design criteria for software and hardware partitioning.
Small block size, hierarchical, exact match inverse, and short-word-length transform
The H.264/AVC, like other standards, applies transform coding to the motion-compensated prediction residual. However, unlike previous standards that use an 8 x 8 Discrete Cosine Transform (DCT), this transform is applied to 4 x 4 blocks and is exactly invertible in a 16-bit integer format. The small block helps reduce blocking and ringing artifacts, while the precise integer specification eliminates any mismatch issues between the encoder and decoder in the inverse transform.
Furthermore, an additional transform based on the Hadamard matrix is also used to exploit the redundancy of 16 DC coefficients of the already transformed blocks. Compared to a DCT, all applied integer transforms have only integer numbers ranging from –2 to 2 in the transform matrix. This enables computation of the transform and the inverse transform in 16-bit arithmetic using only low-complexity shifters and adders.
Arithmetic and context-adaptive entropy coding
Two methods of entropy coding exist:
A low-complexity technique based on the use of CAVLC
The computationally more demanding algorithm ofCABAC
CAVLC is the baseline entropy coding method of H.264/AVC. Its basic coding tool consists of a single VLC of structured Exp-Golomb codes, which by means of individually customized mappings are applied to all syntax elements except those related to the quantized transform coefficients. For the CABAC, a more sophisticated coding scheme is applied. The transform coefficients are first mapped into a 1-D array based on a pre-defined scan pattern. After quantization, a block contains only a few significant non-zero coefficients.
Based on this statistical behavior, five data elements are used to convey information of the quantized transform coefficients for a luminance 4 x 4 block. The efficiency of entropy coding can be improved more if using CABAC.
There are two parts in CABAC. They are the arithmetic coding core engine and its associated probability estimation, and they are specified as multiplication-free, low complexity methods using only shifts and table look-ups. The use of adaptive codes enables adaptation to non-stationary symbol statistics. By using context modeling based on switching between conditional probability models estimated from previous coded syntax elements, CABAC can achieve a reduction in bit rate between 5-15 percent compared to CAVLC.
Figure 4 depicts a typical system-level functional block partition of the H.264/AVC SD video codec. The solution is implemented on the Spectrum Digital EVM DM642 evaluation module for the Texas Instruments TMS320DM642 DSP, together with the Xilinx XEVM642-2VP20 Virtex-II Pro or XEVM642-4VSX25 Virtex-4 daughtercard.

Increased coding efficiency equals success
When used in an optimized fashion, the coding tools of the H.264/AVC standard increase coding efficiency by about 50 percent compared to previous video coding standards, such as MPEG-4 part 2 and MPEG-2, for a wide range of bit rates and resolutions. Currently, it is the most likely successor to the widely used MPEG-2. In addition, it will work itself into many communication and consumer electronic devices that employ video coding technology in the next few years.

wangfm · 发表于 2006-1-24 13:49:11

能否翻译成中文？

mokyy · 发表于 2007-11-18 15:20:31

谢谢楼主分享

[原创] [转载]结合可编程DSP和FPGA实现高效H.264/AVC视频编码

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

相关帖子

[转载]结合可编程DSP和FPGA实现高效H.264/AVC视频编码

浏览过的版块

站长推荐 /2