Design and Field Programmable Gate Array Implementation of Basic Building Blocks for Power-Efficient Baugh-Wooley Multipliers

: Problem statement: As growing demands on portable computing and communication systems, the power-efficient multipliers play an important role. In these multipliers basic multiplication follows the Baugh-Wooley algorithms and can easily be implemented using Field Programmable Gate Array (FPGA) devices because the development cost for Application Specific Integrated Circuits (ASICs) are high. These algorithms should be verified and optimized before implementation. Approach: This study presented the design and implementation of basic building blocks used in power-efficient Baugh-Wooley multipliers using FPGA Spartan-3AN device. The implementation of power-efficient Baugh-Wooley multipliers is done using Very High speed integrated circuit Hardware Description Language (VHDL). Results: The design and implementation of basic building blocks of power-efficient Baugh-Wooley multipliers showed reasonable FPGA resource utilization, which is an indication that rest of the available resources could be utilized for other embedded resources. Conclusion: The Spartan-3AN FPGA device could be used at this stage of basic building blocks reasonably.


INTRODUCTION
The power-efficient multiplier plays an important role of Very Large-Scale Integration (VLSI) systems, as demands on portable computing and communication systems are growing. These multipliers often follows Baugh-Wooley algorithm (Baugh and Wooley, 1973). Multiplication is an important and essential operation in many algorithms used in Digital Signal Processing (DSP). Over the years the computational complexities of algorithms used in Digital Signal Processors (DSPs) have been gradually increased. Therefore, DSP requires fast and efficient parallel multipliers for general purpose as well as application specific architectures.
In many situations, multipliers lie directly in the critical path and the processing speed is ultimately limited by the multiplication speed. The increase in portable multimedia systems requires low power operators in order to maintain reliability and therefore provides longer duration of operations. Multipliers are among the main contributor to the total power, in particular, because performance requirements often necessitate use of high speed parallel multipliers.
Power-efficient design of such parallel multipliers is very important.
In many cases implementation of DSP algorithm demands using Application Specific Integrated Circuits (ASICs). In particular, if the processing has to be performed under real time conditions, such algorithms have to deal with high throughput rates. This is especially required for image processing applications. Since development costs for ASICs are high, algorithms should be verified and optimized before implementation.
Nevertheless, VLSI technology has grown up to such stage that a hardware implementation has become a desirable alternative. Significant speedup in computation time can be achieved by assigning computation intensive tasks to hardware and by exploiting the parallelism in algorithms. Recently, Field Programmable Gate Arrays (FPGAs) have emerged as a platform of choice for efficient hardware implementation of computation intensive algorithms. Therefore, FPGA has become viable technology and an attractive alternative to ASICs (Maxfield, 2004;Todman et al., 2005). Applications Such as DSP, image processing and multimedia require extensive use of multiplication and squaring functions (Sheu and Lin, 2002;Walters et al., 2003). In many DSP processing algorithms such as digital filters, Discrete Cosine Transform (DCT) and wavelet transform, it is desirable to provide full precision multiplication and fixed width multiplication (Lim, 1992;Schulte and Swartzlander, 1993;Kidambi et al., 1996;Swartzlander, 1999;Van et al., 2000;Van and Yang, 2005) that produces n-bit output product with n-bit multiplier and n-bit multiplicand with low error. If the product is truncated to n-bits, the least-significant columns of the product matrix contribute little to the final result. To take advantage of this, truncated multipliers and squarers do not form all of the least-significant columns in the partial-product matrix (Stine and Duverne, 2003). By eliminating more columns the area and power consumption of the arithmetic unit are significantly reduced in many cases the delay also decreases. The fixed width multipliers derived from (Baugh and Wooley, 1973) multiplier produce n-bit output product with n-bit multiplier and n-bit multiplicand. Area saving of a fixed width multiplier can be achieved by directly truncating n least significant columns and preserving n most significant columns. Though, truncating the multiplier matrix introduces additional error into the computation.
Cryptography applications, also requires not only a significant number of multiplication and squaring functions but also large integers (Stallings, 2010). Achieving efficient realization of the multiplication may have a significant impact on the specific applications in terms of speed, power dissipation and area. Many research efforts have been presented in literature to achieve hardware efficient implementation of a truncated multiplier. The basic idea of these techniques is to discard some of the less significant partial products and to introduce a compensation circuit that partly compensates for the dropped terms, thereby reducing approximation error (Jou et al., 1999;Kuang and Wang, 2006;Strollo et al., 2005). Garofalo et al. (2008) presented a truncated multiplier with minimum square error for every inputs' bit width. Rais (2009a;2009b; presented design and implementation of fixed width standard and truncated multipliers using FPGA devices. The objective of this study is to present the design and implementation of basic building blocks used in power-efficient Baugh-Wooley multipliers (Kuang and Wang, 2007;Tu and Van, 2009) using FPGA Spartan-3AN device.

Mathematical basis of Baugh-Wooley multipliers:
Considering the multiplication of two 2s-complement n-bit inputs X and Y, can be represented by: where, x i , y i ∈{0,1}. The 2n-bit full precision product P FP can be written as: FP n 2 n 2 2n 2 i j n 1 n 1 i j i 0 j 0 n 2 n 1 n 1 j n 1 j j 0 n 2 n 1 n 1 i n 1 j i 0 P X Y (x y 2 x y 2 2 ( 2 x y 2 1).
2 ( 2 y x 2 1) Equation 2 shows the Baugh-Wooley algorithm (Tu and Van, 2009). Figure 1 shows the partial product array diagram for n × n 2s-compliment multiplication for Baugh-Wooley multiplier, where notation w means to keep n + w most significant columns of the partial product for fixed width multiplications. If w = n, the fixed-width multiplier becomes a full-precision multiplier.
Targeted FPGA device: Due to the parallel nature, high frequency high density of modern FPGAs, they make an ideal platform for the implementation of computationally intensive and massively parallel architecture. A brief introduction about state-of-the-art Spartan-3 FPGA from Xilinx is presented.

Spartan-3 FPGAS:
The Spartan-3 FPGA belongs to the fifth generation Xilinx family. The family consists of eight member offering densities ranging from 50,000-5 million system gates. The Spartan-3 FPGA consists of five fundamental programmable functional elements: CLBs, IOBs, Block RAMs, dedicated multipliers (18×18) and Digital Clock Managers (DCMs), Spartan-3 family includes Spartan-3L, Spartan-3E, Spartan-3A, Spartan-3A DSP, Spartan-3AN and the extended Spartan-3A FPGAs. The Spartan-3AN is used as a target technology in this Study. Spartan-3AN combines all the feature of Spartan-3A FPGA family plus leading technology in system flash memory for configuration and nonvolatile data storage (Xilinx, 2008).

FPGA design and implementation results and discussion:
The design of basic logic diagrams of the Baugh-Wooley processing elements are done using VHDL and implemented in a Xilinx Spartan-3 AN XC3S700AN (package: fgg484, speed grade: -5), FPGA using the Xilinx ISE 9.2i design tool. Figure 2 shows the basic logic diagrams of the Baugh-Wooley processing elements and Table 1 summarizes the FPGA resource utilization for Basic logic diagrams of the Baugh-Wooley processing elements.

DISCUSSION
The Spartan-3AN FPGA device shows a reasonable contribution; such as Four Input Look Up Tables (LUTs), number of occupied slices, Bonded IOBs, and total equivalent gate count are ranges from 1-2, 1-2, 3-7, and 6-18 respectively. So out of 11776 Four Input LUTs, a total of 30 are used whereas only 20 out of 5888 number of occupied slices have been used.
The design and implementation of basic building blocks of power-efficient Baugh-Wooley multipliers shows reasonable FPGA resources utilization, which is an indication of better utilization of the FPGA resources. The rest of the FPGA device resources could be utilized for other part of the multiplier quite efficiently.

CONCLUSION
In this study we have presented hardware design and implementation of FPGA based basic logic diagrams of the Baugh-Wooley processing elements utilizing VHDL. The design was implemented on Xilinx Spartan-3AN XC3S700AN FPGA device using the ISE 9.2i design tool. The objective of this study is to present the design and implementation of basic building blocks used in power-efficient Baugh-Wooley multipliers. The design and implementation of basic building blocks of power-efficient Baugh-Wooley multipliers shows reasonable FPGA resources utilization, which is an indication of better utilization of the FPGA resources. The rest of the FPGA device resources could be utilized for other part of the multiplier quite efficiently. The future investigation will present the utilization of these basic logic processing elements in fixed width power-efficient multiplier.