Joe Mallett, Senior Product Line Manager, Xilinx Corporation, [email protected]
Akos Zarandy, Co-CTO and Vice President, Eutecus, [email protected]
The processing bandwidth requirements of a wide range of security analytics applications are forcing companies to rethink their approach to system hardware design. A single video and image DSP processor can no longer perform some computationally intensive analysis operations at acceptable data rates. Also, there is no robust and reliable solution capable of handling high resolution (HD) at full video frame rate. This also forces system engineers to consider multi-chip or other single-chip systems. Both solutions have their advantages and disadvantages.
A multi-chip system consisting of multiple DSPs generally provides designers with a more familiar design flow, but increases PCB cost, occupies board/system level space, and may introduce system performance issues. On the other hand, single-chip solutions appear to have advantages in cost, packaging, and power consumption, but may invisibly increase the learning curve for designers, increase design project complexity and engineering costs, and potentially delay product releases time.
That’s the challenge that Berkeley, California-based video analytics company Eutecus encountered when developing its next-generation security analytics product, the Multicore Video Analytics Engine (MVE™).
Our first-generation products are based on Texas Instruments’ (TI) DaVinci digital media system-on-chip (SoC) platform. But in the second generation of products, we need more powerful processing power and system integration. We quickly discovered that a solution with multiple DSP devices was not cost-effective or system-level. We needed a single-chip solution that could easily port over the previous generation and provide more features for our second-generation MVE.
After some research, we found the Sp artan ®-3A DSP 3400A from Xilinx. The device offers 126 dedicated XtremeDSP® DSP48A slices, which provide enough performance to meet our system requirements at an attractive price.
Our concerns about design porting quickly disappeared when we further learned that the Xilinx Embedded Development Kit (EDK) supports the Spartan-3A DSP. Xilinx’s EDK Embedded Development Kit can implement a dual-processor hardware architecture based on Xilinx’s MicroBlaze® embedded processor, similar to the dual-processor hardware architecture of TI’s DaVinci platform.
After the device was selected, the porting of the existing DaVinci-based code to the Xilinx dual-processor embedded system began to create a single-chip video security analytics design. Then, an appropriate amount of accelerator modules are created in the FPGA fabric to meet the performance requirements, which include processing high-resolution video at full frame rate. This is the second generation of the MVE system, which has now been successfully sold into the aerospace/defense, machine vision and surveillance markets.
Video Analytics Product Brief
The Multicore Video Analysis Engine (MVE) is based on InstantVision Embedded® software and a dedicated C-MVA® coprocessor that provides many advanced features.
The latest version of MVE/C-MVA is capable of processing high-resolution video at full frame rate. It consumes less than 1 watt and is capable of executing multiple event detection and classification algorithms in full parallelism. Figure 1 shows an example of video analytics output in a traffic monitoring application, categorized for different types of vehicles, traffic direction, lane changes, and illegal lane changes, all concurrently and utilizing different colors marked.

The goal of designing the C-MVA coprocessor is to be able to scale its computational complexity to support analysis functions in dense object spaces, where overlapping analysis and processing of incomplete objects/events are particularly challenging. Dedicated DSPs are poorly supported in this regard, and computationally scalable. And FPGA has more flexibility in these two respects.
The 126 XtremeDSP DSP48A slices in the Spartan-3A DSP 3400A FPGA can provide up to 30 GMACs of DSP performance, so they can fully meet the demanding cost and performance requirements of video analytics applications. Xilinx FPGAs also allow us to add more video analytics capabilities and related event detection cases based on customer needs. We make a summary in Table 1.

Additionally, with Xilinx FPGA and ISE® Design Suite tools, video analytics design groups can provide greater flexibility in customizing solutions for end customers. By rapidly prototyping standard and high-resolution video processing, we can rapidly customize video analytics engines and system-on-chip (SoC) solutions. This allows us to more efficiently utilize the resources available in the Spartan-3A DSP 3400A or the lower cost Spartan-3A DSP 1800A FPGA device based on customer needs.
Another benefit of FPGA solutions is that many different derivatives can be created from the same hardware platform. Since we have designed various analysis accelerator engines using VHDL, it is possible to integrate these specialized cores into the C-MVA coprocessor. This approach allows engineers to reuse dual MicroBlaze embedded systems to create different FPGA programming files, resulting in a highly scalable solution that can be easily adapted to a wide range of video analytics applications.
Migrating from DaVinci to Xilinx FPGAs
Our previous generation of video analytics products were based on the TI DaVinci digital media SoC chip TMS320DM6446. The chip includes an ARM9x processor and a C64x+ DSP coprocessor. In the design, we use ARM9x for communication and control, and C64x+ for DSP processing of analysis algorithms. However, the combination of the two still falls short of the high-performance processing requirements required by our second-generation product. Therefore, we turned to the Spartan-3A DSP FPGA family.
By creating a Xilinx embedded system with two independently running MicroBlaze v7 soft-core processors, we have simplified the task of design migration.
This architecture allows us to port ARM and DSP processor code separately, which greatly simplifies the design porting process.Figure 2 shows the Eutecus hardware system
A block diagram of the system, as well as an MVE-based reference SoC design.

Our MVE engine includes the InstantVision embedded software running on the MicroBlaze (MB0), the system control and communication sections running on the MicroBlaze (MB1), and the C-MVA co-processor. The C-MVA coprocessor is a chain of hardware accelerator IP core modules running on the FPGA fabric.
Using the ISE Design Suite and the MicroBlaze soft core, our ARM and DSP code porting was fairly straightforward. An outstanding advantage is that the InstantVision cross-platform environment is written in high-level standard C/C++ language and requires only minimal modifications.
Once the code is ported, we verify its functional correctness and identify performance bottlenecks. It turns out that optimizing and accelerating C/C++ code developed for the original TI processors is a significant challenge, as we used several DaVinci C64x+ coprocessing acceleration modules for assembly-level optimization during the development of this platform . During the conversion process, we followed a series of steps: First, the modules were rewritten with high-level C functions. Finally, most of the functionality of these modules is replaced with equivalently functioning accelerator modules running on the FPGA fabric.
From a functional point of view, the MVE solution consists of three layers that will receive a standard/high-resolution video stream as input data and then generate event detection metadata. The generated metadata provides object/event tracking and classification support, while some image streams for debugging purposes are also output as analysis.Our features
Modules are implemented either through embedded software running on MicroBlaze processors or as dedicated IP cores.We put these dedicated hardware
The accelerators are placed in the FPGA structure, and the accelerator chain formed by these accelerators constitutes the C-MVA analysis coprocessor.
As shown in Figure 3, the three algorithm layers of the MVE video analysis engine include several main functional modules. These functional blocks can be greatly accelerated using dedicated IP cores that are dynamically configured using the resources available in the FPGA. The design of the C-MVA coprocessor is based on these IP cores, as is the front-end and mid-layer (see Figure 4) acceleration of the entire analysis algorithm. We can use this modular design approach supported by the Xilinx ISE Design Suite to scale the system in both performance and power consumption.


Supercharging with FPGA Accelerator Blocks
To truly realize the full potential of FPGA video analysis systems, we need to integrate video acceleration engines into embedded systems. We foresee several performance bottlenecks, so the design team started early development of a set of accelerators in VHDL. As part of the Xilinx ISE Design Kit and Embedded Development Kit (EDK), the Code Profiler helped us further identify performance bottlenecks and develop all the accelerator blocks needed for the design. Table 2 provides a comprehensive list of family IP cores.

Like other development groups, our development group consists of different hardware and software developers. Retaining enough abstraction between these two design domains is critical to maintaining developer productivity for the success of the project. We utilize Create in Xilinx Platform Studio
The IP Wizard improves this task by generating RTL modules and software driver files for hardware acceleration modules.
These modules include interface logic required to access registers, DMA logic and FIFOs in embedded systems. Once the RTL is created with the Module, we place it in the embedded IP catalog, which the designer can further modify as needed.
Our IP core development flow includes a common standard peripheral module development flow for PLB46MPMC-OPB based backhaul. These peripherals include single-ended and multi-I/O prototypes (SIMO, MIMO, MISO models), allowing us to flexibly create multi-threaded coprocessor pipelines for demanding image stream processing algorithms. We achieve this by combining and configuring IP cores in nearly arbitrary order during the design and customization of different analysis engines.
The MVE analysis engine consists of InstantVision embedded software modules and hardware accelerators that make up the C-MVA analysis co-processor. We prototyped the MVE in a Xilinx Spartan-3A-DSP 3400A FPGA and created a SoC reference design. This includes all the I/O functions required for communication and data flow (refer to Figure 2 for a complete hardware-firmware block diagram). This complete SoC reference design uses 91% of the slice resources, 81% of the block RAM and 32% of the DSP slice and includes not only the MVE analysis engine, but also all supporting I/O blocks.
For the MVE analysis engine alone (excluding MPMC-PLB backbone and dedicated I/O components), it uses only 46% of the logic slice, 44% of the block RAM and 23% of the DSP slice, so it can be ported to the lower cost Spartan3A-DSP 1800A FPGA device.
In a single clock cycle, all IP cores in our designed CMVA coprocessor can complete all related processing. This capability, combined with the asynchronous FSL interface, enables system integrators to drive the C-MVA coprocessor with different clock domains from other parts of the system. Doing so allows the C-MVA to operate at a lower pixel clock frequency while utilizing the higher frequency internal system clock to drive the backbone, thereby greatly reducing power consumption while maintaining system performance requirements.
Customization, Packaging and System Integration
To validate and further develop this system, we created a security/monitoring application that includes all software layers, allowing users to quickly integrate our products at different levels of the system. A complete SoC design includes hardware IP cores, firmware, and software in a single reference design, see Figure 5.

We can flexibly customize at different levels of hardware, firmware and software components to form system integration. Server-level customization includes customizable SoC design in PGA, while at client (configuration) level, modifications can be made at the WIN32 or Net API layer. This architecture allows us and our customers to quickly prototype different configurations and test interfaces. Users can implement client-server (C/S) communication over UART or TCP/IP, providing flexible configuration management, performance fine-tuning, status monitoring and firmware upgrades.
Although the second generation has just been completed, we are already considering the requirements for the third generation. Based on the experience gained in this project, we will focus on Xilinx FPGA devices in the next generation of products, especially Xilinx is committed to using the most advanced process technology to introduce more reliable and advanced new devices and DSP function.
The Links: NL8060AC31-12 G170EG01 V1 IGBTS