Hardware Implementation of Truncated Multipliers Using Spartan-3AN, Virtex-4 and Virtex-5 FPGA Devices

: Problem statement: The development cost for Application Specific Integrated Circuits (ASICs) are high, algorithms should be verified and optimized before implementation. The Digital Signal Processing (DSP), image processing and multimedia requires extensive use of multiplication. The truncated multipliers can easily be implemented using Field Programmable Gate Array (FPGA) devices. Approach: This research presented the comparative study of Spartan-3AN, Virtex-4 and Virtex-5 FPGA devices. The implementation of standard and truncated multipliers is done using Very high speed integrated circuit Hardware Description Language (VHDL). Results: Remarkable reduction in FPGA resources, delay and power was achieved using truncated multipliers instead of standard parallel multipliers when the full precision of the standard multiplier is not required. The three devices showed significant improvement for truncated multipliers as compared to standard multipliers. Results showed that the anomaly in Spartan-3AN average connection and maximum pin delay have been efficiently reduced in Virtex-4 and Virtex-5 devices. Conclusion: The Virtex-5 FPGA device showed better performance as compared to Spartan-3AN and Virtex-4 FPGA devices.


INTRODUCTION
Multiplication is a core operation in many algorithms used in scientific computations such as Digital Signal Processing (DSP). Over the years the computational complexities of algorithms used in Digital Signal Processors (DSPs) have been gradually increased. Therefore, DSP requires fast and efficient parallel multipliers for general purpose as well as application specific architectures.
In many cases implementation of DSP algorithm demands using Application Specific Integrated Circuits (ASICs). In particular, if the processing has to be performed under real time conditions, such algorithms have to deal with high throughput rates. This is especially required for image processing applications. Since development costs for ASICs are high, algorithms should be verified and optimized before implementation.
Nevertheless, Very Large Scale Integration (VLSI) technology has grown up to such stage that a hardware implementation has become a desirable alternative. Significant speedup in computation time can be achieved by assigning computation intensive tasks to hardware and by exploiting the parallelism in algorithms. Recently, Field Programmable Gate Arrays (FPGAs) have emerged as a platform of choice for efficient hardware implementation of computation intensive algorithms (Maxfield, 2004). FPGA have the benefit of hardware speed and the flexibility of software. FPGAs enable a high degree of parallelism and can achieve orders of magnitude speedup over General Purpose Processors (GPPs). This is a result of increasing embedded resources available on FPGA. The three main factors that play an important role in FPGA based design are the targeted FPGA architecture, Electronic Design Automation (EDA) tools and design techniques employed at the algorithmic level using hardware description languages. In FPGAs, the choice of the optimum multiplier involves three key factors: area, propagation delay and reconfiguration time. Therefore, FPGA has become viable technology and an attractive alternative to ASICs (Maxfield, 2004;Todman et al., 2005).
Applications such as DSP, image processing and multimedia require extensive use of multiplication and squaring functions (Sheu and Lin, 2002;Walters et al., 2003). A full width digital n×n multiplier computes the 2n output as a weighted sum of partial products (Baugh and Wooley, 1973). If the product is truncated to n-bits, the least-significant columns of the product matrix contribute little to the final result. To take advantage of this, truncated multipliers and squarers do not form all of the least-significant columns in the partial-product matrix (Stine and Duverne, 2003;Swartzlander, 1999). By eliminating more columns the area and power consumption of the arithmetic unit are significantly reduced and in many cases the delay also decreases. Though, truncating the multiplier matrix introduces additional error into the computation.
Cryptography applications, also requires not only a significant number of multiplication and squaring functions but also large integers (Stallings, 2006). Achieving efficient realization of the multiplication may have a significant impact on the specific applications in terms of speed, power dissipation and area. Many research efforts have been presented in literature to achieve hardware efficient implementation of a truncated multiplier. The basic idea of these techniques is to discard some of the less significant partial products and to introduce a compensation circuit that partly compensates for the dropped terms, thereby reducing approximation error (Jou et al., 1999;Van et al., 2000;Kidambi et al., 1996;Strollo et al., 2005;Lim, 1992;Kuang and Wang, 2006). Garofalo et al. (2008) presented a truncated multiplier with minimum square error for every inputs' bit width. Rais (2009a;2009b) presented design and implementation of fixed width standard and truncated multipliers using FPGA devices.
Truncated multiplication provides an efficient method for reducing the power dissipation and area of rounded parallel multiplier. High speed multiplication is desired in DSP which is normally achieved by parallel processing and pipelining, but by truncation that can be multi fold. The objective of this study is to present a comparative study of truncated and standard multiplier using Spartan-3AN, Virtex-4 and Virtex-5 FPGA devices.

Mathematical basis of truncated multipliers:
Considering the multiplication of two n-bit inputs X and Y, a standard multiplier performs the following operations to obtain the 2n bit product P: where, x i , y i and P i represent the ith bit of X, Y and P, respectively. Figure 1 shows the standard architecture of 6×6-bit parallel multiplier, where HA and FA are the half and full adders respectively. Equation 1 can be expressed by the sum of two segments: The mostsignificant part MP and the least-significant part LP: The standard 6×6-bit parallel multiplier can also be divided into three subsets: The most-significant part MP, input correction IC and the least-significant part LP. Equation 2 can be rewritten as follows: cell HA: cell FA: Fig. 1: The architecture of a standard 6×6-bit parallel multiplier The fixed width multiplier can be obtained directly by removing the LP region and introducing the IC region to obtain MP' region, which is truncated multiplier as shown in Fig. 2 and given by Eq. 4: Architecture platform: Due to the parallel nature, high frequency and high density of modern FPGAs, they make an ideal platform for the implementation of computationally intensive and massively parallel architecture. A brief introduction about state-of-the-art FPGAs from Xilinx is presented.

Spartan-3 FPGAs:
The Spartan-3 FPGA belongs to the fifth generation Xilinx family. It is specifically designed to meet the needs of high volume, low unit cost electronic systems. The family consists of 8 member offering densities ranging from 50,000 to five million system gates (Xilinx, 2008a). The Spartan-3 FPGA consists of five fundamental programmable functional elements: CLBs, IOBs, Block RAMs, dedicated multipliers (18×18) and Digital Clock Managers (DCMs), Spartan-3 family includes Spartan-3L, Spartan-3E, Spartan-3A, Spartan-3A DSP, Spartan-3AN and the extended Spartan-3A FPGAs. Particularly, the Spartan-3AN is used as a target technology in this study. Spartan-3AN combines all the feature of Spartan-3A FPGA family plus leading technology insystem flash memory for configuration and nonvolatile data storage.

Virtex-4 FPGAs:
Viretx-4 FPGAs are produced on a state-of-the-art 90 nm copper process, using 300 nm wafer technology (Xilinx, 2007). It consists of three platform families i.e., LX, SX and FX. Virtex-4 hard-IP core blocks include the IBM Power PC (PPC) 405 32bit Reduced Instruction Set Computer (RISC) processor; tri-mode Ethernet Media Access Controls (MACs) 622 Mbps-6.5 Gbps serial transceivers, dedicated DSP slices and high speed clock management circuitry.
Virtex-4 devices consumes approximately 50% the power of respective Virtex-II Pro devices due to static and dynamic power reduction enabled by triple-oxide technology and reduced core voltage and capacitance respectively. The Virtex-4 FPGA family comprises of CLBs, Block RAMs, XtremeDSP Slices and DCMs.

Virtex-5 FPGAs:
The Virtex-5 devices built on a 65nm state-of-the-art copper process technology are a programmable alternative to custom ASIC technology. The Virtex-5 LX platform also contains many hard-IP system-level blocks, including Block RAM/First In First Out (FIFO), second generation 25×18 DSP slices, SelectIO technology with built-in digitally-controlled impedance, ChipSync source-synchronous interface blocks, enhanced clock management tiles with integrated DCM and Phase Locked Loop (PLL) clock generators and advanced configuration options.
In addition to the regular programmable functional elements, Virtex-5 family provides power-optimized high speed serial transceiver blocks for enhanced serial connectivity, tri-mode Ethernet MACs and highperformance PPC 440 microprocessor embedded blocks. Virtex-5 devices also use triple-oxide technology for reducing the static power consumption. Their 1.0 V core voltage and 65nm implementation process leads also to dynamic power consumption reduction as compared to Virtex-4 devices.
Advanced DSP48E slices are available in Virtex-5 FPGAs that helps in accelerating computation intensive DSP and image processing algorithms. These slices can operate at a maximum frequency of 550 MHz, drawing only 1.38 mW of power at 100 MHz frequency (Xilinx, 2008b).
The four input Look-Up-Tables (LUTs) for standard and truncated multipliers for Spartan-3AN, Virtex-4 and Virtex-5 FPGA devices are shown in Fig. 4. The same phenomenon is observed here as compared to number of occupied slices for all the devices. Figure 5 illustrates the total equivalent gate count obtained for standard and truncated multipliers for Spartan-3AN, Virtex-4 and Virtex-5 FPGA devices.

DISCUSSION
The Virtex-5 FPGA device shows better performance than Virtex-4 and Spartan-3AN FPGA devices with a percentage ratio of occupied slices for standard to truncated multipliers is increased from 40% to 73.86% as compared to Spartan-3AN and Virtex-4 FPGA devices is decreased from 68.75-58.78%. Another important feature is also visible in Fig. 3 is that the Virtex-5 FPGA device almost uses same resources for standard multiplier as the truncated multiplier for the Spartan-3AN and Virtex-4 FPGA devices.
The abnormality in average connection delay and maximum pin delay for Spartan-3AN device (12×12-bit multiplier for standard and truncated multiplier) has been significantly reduced in Virtex4 and Virtex-5 FPGA devices as shown in Fig. 6 and 7

CONCLUSION
In this study we have presented hardware design and implementation of FPGA based parallel architecture for standard and truncated multipliers utilizing VHDL. The design was implemented on Xilinx Spartan-3AN XC3S700AN, Virtex-4 XC4LX40 and Virtex-5 XC5VLX40 FPGA devices using the ISE 9.2i design tool. The objective is to present a comparative study of the standard and truncated multipliers. The truncated multiplier shows much more reduction in device utilization as compared to standard multiplier. The standard and truncated multipliers shows that the average connection delay and maximum pin delay have been significantly reduced in Virtx-4 and Virtex-5 FPGA devices. The Viretx-5 FPGA devices achieves better result than Spartan-3AN and Virtex-4 FPGA device and is a viable FPGA device for image processing, multimedia and digital signal processing.