FPGA for DSP: A JPEG Encoder Case Study

Scott Thibault, Green Mountain Computing Systems, Inc.

Introduction

This case study was to investigate the design of a JPEG encoder [1] on an FPGA target and a comparison of the design to an encoder on a DSP target. The study was based on a previous study of a JPEG encoder performed at UCLA [2]. In their study, several student teams were charged with the task of designing a JPEG encoder, but each team targeted a different target platform or design tool. They present results for the following tools/platforms:

For this study, I started with an example dataflow implementation in C developed at UCLA, and targeted an FPGA using the Impulse C design software from Impulse Accelerated Technologies [3]. Impulse C is a C-based hardware synthesis tool based on the Stage Master TM synthesis engine licensed from Green Mountain Computing Systems, Inc. The remaining sections present the Impulse C design flow, the JPEG implementation, performance results, and a comparison to the UCLA results on a DSP.

Impulse C Design Flow

Impulse C is an FPGA design tool that enables the user to develop high-level dataflow designs in C, and automatically synthesize the designs to VHDL for implementation on an FPGA. A design is described as a set of one or more processes, each implemented by a standard C procedure. The following diagram illustrates the Impulse C design flow when starting from legacy C code, as in this study.

First, the legacy code is partitioned into one or more processing blocks that will communicate with each other via streams, signals, or shared memory. Each processing block is represented by a C procedure. The blocks communicate using any of the stream, signal, or shared memory functions provided by the Impulse C library. A C configuration function is then written that creates the processes along with the streams, signals, and shared memory blocks that connect them together. Again, this is accomplished with functions provided by the Impulse C library. All of the processes, communication and configuration are described using standard C code, and the user can use any standard development environment to develop the code.

The next step is to compile and run the application to verify its behavior. Impulse C provides a desktop implementation of the Impulse C library, based on threads, to allow the application to be compile and run using standard C/C++ development tools on the desktop.

The final step is to synthesize. In this step, each process is configured to target either software or hardware. The Impulse C software will synthesize the C code representing each hardware process into VHDL, as well as the necessary hardware/software interfaces that allow the hardware and software processes to communicate.

Implementation

For this study, I started with a reference design of the JPEG encoder written in C using a dataflow style. The reference design consisted of about 1000 lines of C code that was divided up into 11 processing blocks. As described in the previous section, the first step then was to modify the existing code to represent Impulse C processes, use the Impulse C library for communication, and write a configuration function. The algorithm was already partitioned into individual processing blocks that were applied in sequence to each block of the image, so the translation was not that difficult. In this case, step one involved the following:

Step 1 took less than a day. The next step was to select a set of blocks that would be targeted to hardware. In this case, I decided to implement 8 of the 11 processes in hardware. Only the blocks that read the source image from the file and prepare it for processing, and the block that writes the JPEG image to a file were not synthesized to hardware.

Not all C constructs can be synthesized, so some minor modifications were required to synthesize the 8 blocks as follows:

The resulting C code for the hardware processes consisted of about 100 lines of code. Step 2 took about two hours and I was able to verify the results using cycle simulation. The cycle simulation feature of Impulse C allows the user to validate the hardware synthesis by converting the VHDL result back into C and compiling it into the original application for execution on a desktop.

The final step was to compile the generated VHDL code using the Xilinx synthesis tools to obtain size and performance results. This step took about an hour and involved the following:

After completing the design, I was not satisfied with the size of the design and recognized that three of the processing blocks could be merged into a single block performing two passes over the data. It took about another 1.5 hours to modify the code, debug/verify with cycle simulation and re-run the Xilinx tools.

Results

The previous section describes in some detail how I created the FPGA JPEG encoder using Impulse C. The final result had 15 hardware blocks running in parallel and connected via 24 data streams. The design was synthesized using Xilinx EDK 6.1 for the XC2VP20FF1152-6 FPGA. The design took 7725 of the 9280 available slices on the device and had a maximum operating frequency of 50MHz. I also performed some VHDL simulation to determine that the design requires about 78,000 cycles to encode the test image used in the UCLA study. Therefore, the average performance is about 41,000 blocks (8x8 pixels) per second.

The following table shows the results of this design compared to the UCLA students' and extrapolated results.

Platform JPEG Performance
(blocks/second)
FPGA (Impulse C) 41025
Blackfin (750MHz)** 31505
Blackfin (300MHz)* 12602
TI C5410 (160MHz)* 6835
FPGA (DK1)* 1357
FPGA (A|RT)* N/A

* Performance reported in [2].
** Performance estimated from performance reported in [2] for the highest frequency available today.

The extrapolated results are taken by using the cycle counts reported in [1] with today's DSP frequencies by linear interpolation. Of course, actual performance does not scale linearly with clock frequency. Newer processors may also have other features such as more parallel execution units that might increase performance.

Conclusion

In this study, I took some existing legacy C code implementing a JPEG encoder and created an FPGA design using the Impulse C synthesis tool. The entire project took less than two days. The resulting design is roughly equivalent in terms of the number of lines of code, and the performance is almost 4x faster than the fastest performance obtained with a DSP in [2]. The results show that with little effort, one can obtain reasonable performance from an FPGA and even outperform some of the fastest DSPs available. These results were obtained with almost no optimization. Possible optimizations are currently under study and will be reported here when available.

References

[1] ITU-81, The JPEG Standard.

[2] K. Sakiyama, P. Schaumont, I. Verbauwhede, "Finding the best system design flow for a high-speed JPEG encoder," Asia and South Pacific Design Automation Conference (ASP-DAC 2003), pp. 577-578, KitakYushu, Japan, January 2003.

[3] Impulse C, Impulse Accelerated Technologies.