|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?注册
x
本帖最后由 cjsb37 于 2013-4-29 09:23 编辑
Traditionally, these sensor systems have been supported using a variety of conventional processing technologies such as RISC and DSP processors, or expensive and inflexible custom solutions such as ASICs. As the aggregate throughput requirements for image processing systems approaches GOPS, these types of processing technologies become unsuitable since performances can only be met by concatenating processing blocks in a pipeline architecture. This incremental approach to boosting system performance has limitations, and is often an unacceptable solution, particularly for airborne applications that are constrained by factors such as size, weight, power and environmental conditions. Assuming that these problems are overcome, or at least manageable, applications demanding real time operation cannot function with system latencies in the order of seconds. Compromises are possible through reductions in frame rate and image resolution; however these sacrifices only contribute to overall system performance degradation, and do not address the underlying problem. There is, therefore, a requirement to utilise new types of processing platforms featuring flexible, low latency, high performance devices such as field programmable gate arrays (FPGAs) in order to keep pace with the increasing rates at which data is sensed at the front end of the system.
The majority of image processing algorithms involve the use of mathematical functions executing repetitively upon sample data. This data can comprise of individual pixels, groups of pixels, or entire image frames supplied to a processing device as part of a data stream.
Until recently, the choice of processing technology for such an application has been limited to a range of microprocessors, with the computational capabilities of each new generation of microprocessor increasing steadily in accordance with Moore’s Law. However, even using the latest generation of microprocessor, the underlying performance limitations of this technology still remain. Take for example, the implementation of a typical image processing operation such as convolution. This involves a matrix multiplication resulting in several multiplications per image pixel. Using a modest 5x5 window and a conventional DSP processor, a latency of several clock cycles per pixel is incurred since the overall performance of the processor is limited by the number of multiplications that can be performed in parallel. On-chip processor memory is usually too small to buffer a full image frame, and so external memory read and writes are required to complete a single calculation. This performance bottleneck becomes a problem when larger windows are implemented and higher frame rates and resolutions are used to achieve better imaging accuracies.
Take for example the TigerSHARC DSP processor from Analog Devices. This device features an I/O bandwidth of 1800Mbytes/s (including memory) and a peak floating point performance of 1500MFlops. A Xilinx Virtex-II ProXFPGA, by comparison, is capable of an I/O bandwidth of 37500Mbytes/s, and a peak floating point performance of 25000MFlops.
Unsurprisingly, processing technologies such as FPGAs are an attractive solution for many of the computing challenges associated with high performance image processing systems. The reconfigurable array of logic blocks, memories and multipliers provided within FPGAs by vendors such as Altera and Xilinx, offer a high performance hardware architecture ideal for building processing pipelines operating at hundreds of MHz. Take, for example, the new Virtex-4 range of FPGAs from Xilinx. Manufactured using the latest 90nm processing technology, they provide users with 500MHz XtremeDSP slices delivering an aggregate DSP performance of 256GigaMACs per second. High accuracy Digital Clock Managers, reconfigurable synchronous dual-port static BRAM and FIFOs provide the necessary clock management and memory resource required to implement high performance algorithms. Continuing the trend set by the previous generation of Virtex-IIProFPGAs, the Virtex-4 features 32-bitRISC PowerPC processors delivering an excess of 1300DhrystoneMIPS.
As a result, FPGAs are now being used as DSP engines. Although today’s DSP processors boast high levels of performance, they can’t compete against FPGAs for specialised computing. FPGAs can be configured with a custom hardware design, implementing control logic in the hardware, saving precious clock cycles per calculation.
Innovation and state of the art silicon processing techniques such as those used for the new Virtex-4 range of devices have dramatically improved the functionality and capability of the FPGA over the past six years, allowing them to be used in a wide variety of applications typically dominated by microprocessors or expensive and inflexible ASICs.
One of the difficulties for engineers and scientists wanting to use FPGA technology to help improve the performance of their applications has been the availability of flexible, scalable COTS products supporting the latest FPGAs and design tools.
There have been a number of modular standards used over the last few years that have supported new generations of processing technologies, including FPGAs; however they have limitations when used in real time processing applications. Firstly there are those based around specific microprocessors, for example the TIM-40 from Texas Instruments and SHARCPAC from Analog Devices. The main difficulty with this category is that the system engineer has to constrain the capability of supported FPGAs in order to emulate a microprocessor interface – restricting the superior IO bandwidth of the FPGA. Secondly, there are the microprocessor neutral module standards. One of the most popular is the PCI Mezzanine Card (PMC). Unfortunately, this is still principally designed with microprocessor based systems in mind. It is also, perhaps more seriously, based around a non deterministic bus communications system with variable latency. This again implies constraining the FPGA to a less than optimum solution. In addition, significant parts of the FPGA real-estate must be dedicated to handling the non-determinism. Within a real-time system it is critical that bandwidths and latency can be guaranteed. Using this type of module means, practically, that this cannot be done.
In order to address these problems, and present a processing platform that truly exploits the strengths of FPGA technology, Nallatech has developed a range of COTS plug and play motherboards and modules supporting the latest Virtex-II and Virtex-IIProFPGAs from Xilinx. The high-performance ‘DIME-II’ architecture is an open standard incorporating system level intelligence features such as temperature, voltage and current monitoring, and guarantees a module to motherboard bandwidth of up to 8GBytes/sec (over 15 times the theoretical maximum performance of 64bit/66MHzPMC).
Turbulence monitoring
Nallatech recently undertook a project to design a complex multi-board, real time image processing system with a mass storage interface using FPGA technology. The application called for the system to be deployed on a commercial aircraft operating at high altitude, with a high-resolution camera being used to capture the effects of atmospheric turbulence. This raw data was to be processed, formatted and stored for later analysis.
The intention was to upgrade the system at a later date and use the high resolution video data to drive a decision engine that would control the aircraft's avionic systems. This would allow for a smoother flight and better fuel efficiency.
The size, weight and power constraints imposed by the operating environment immediately ruled out the use of certain types of form factors and technologies. The computing power required to process the high-resolution data in real time would have translated into multiple server racks of conventional CPUs – an impractical solution in this case.
The decision to use an FPGA-based processing system was taken early in the project. FPGAs were the only available technology offering the performance, speed and density for the task in hand. An ASIC solution was considered too expensive with lengthy development timescales and no flexibility, should user requirements change. The Nallatech BenNUEY-PC104+ DIME-II carrier card was selected as the main processing platform for the system. The PC104plus form factor satisfied the physical and mechanical constraints of the application, while the scalability and flexibility of the high bandwidth DIME-II architecture allowed the system to be tailored through the support of plug-and-play DIME-II COTS modules.
The optical interface to the aircraft’s high-resolution onboard camera was handled by a DIME-II module called the ‘BenHOTLINK’ that consisted of a Cypress Hotlink transceiver chip closely coupled to a XilinxVirtex-II6000FPGA. The proximity of the FPGA to the front end of the system provided a reconfigurable, low latency processing block that was able to perform massively parallel DSP calculations. The embedded 18-bit multipliers and dual port BRAM of the Virtex-II device were an ideal resource for the image processing algorithms being used. The same functionality implemented using DSP processors would have resulted in an I/O bottleneck, and latencies preventing real time processing.
The Xilinx 2v6000 FPGA situated on the BenNUEY-PC104+ carrier card was used to format the processed image data from the BenHOTLINK FPGA. Eight Mbytes of fast access ZBT SRAM memory attached directly to the BenNUEY’s FPGA was used to buffer the data while it was serialised and transmitted over high speed LVDS links to a bank of 4 SCSI hard drives – providing a total storage capacity of one Terabyte. The data capture and format section of the system operated at 80MHz, with the serial links running at 200MHz.
The majority of the design was written in VHDL in order to achieve the high levels of system performance required for real time operation. Samples of RTL code were simulated with testbenches using Aldec before being synthesised and implemented onto the hardware. Xilinx’s ChipScope ILA tool was used to capture and debug the more complicated, timing sensitive parts of the design.
Using existing COTS hardware to build the front end of the system and the secondary processing/data formatting section helped significantly reduce development timescales, however a custom module was required to allow the direct interface of the SCSI hard drives.
A Xilinx 2v1000 FPGA was situated at the end of each of the high-speed LVDS links, with a 32-bit Xilinx MicroBlaze embedded processor programmed to deal with the asynchronous data transfer to and from the disk, as well as the packet handling and interpretation. The hardware/software partitioning allowed the SCSI interface to be implemented at low speed using asynchronous data transfers. Once this was working successfully for the specific data read and write packets, the actual writing and reading of the data from the disk was carried out using dedicated FPGA fabric to allow support for much faster synchronous data transfer modes. This is a perfect example of the capability and flexibility of FPGAs in embedded systems. This approach to hardware and software partitioning allowed the system to be implemented and tested on the target hardware far earlier than normal. The flexibility of the FPGAs allowed sections of the design to be optimised without physically altering the hardware, while the availability of the spare DIME-II module slots on the BenNUEY-PC104+ offered the customer the option of scaling the system to support additional SCSI disks.
In the longer term, system performance can be improved by utilising the embedded IBM Power PCs and Multi-Gigabit Transceivers featured in the Virtex-II Pro, and Virtex-4 FX FPGAs from Xilinx. Instead of using a ‘softcore’ processor such as Microblaze (which uses FPGA logic resource such as BRAM), the embedded PowerPCs operating at 300MHz would have provided a higher performance, fixed silicon solution. Furthermore, the LVDS links could be upgraded to SATA connections to each of the SCSI drives. The same principles of partitioning would have applied in terms of handling general packets via software with the PPC core, and the handling of specific packets implemented using dedicated FPGA logic using an HDL language.
|
|