Ultra Low Power MUX Based Compressors for Wallace and Dadda Multipliers in Sub-threshold Regime

: The computing efficiency of modern column compression multipliers offers a highly efficient solution to the binary multiplication problem and is well suited for VLSI implementations. The various analyses are established more on compressors circuits particularly with Multiplexer (MUX) design. Conventionally, compressors are anatomized into XOR gate and MUX design. In this study, fully MUX based compressors, utilizing the CMOS transmission gate logic have been proposed to optimize the overall Power-Delay-Product (PDP). The proposed compressors are also used in the design and comparative analysis of 4 × 4-bit and 8 × 8-bit Wallace and Dadda multipliers operating in sub-threshold regime. The multipliers based on the proposed compressor designs have been simulated using 45 nm CMOS technology at various supply voltages, ranging from 0.3 to 0.5 V. The result shows on an average 89% improvement in the PDP of the proposed compressor blocks, when compared with the existing published results in sub-threshold regime. The multipliers designed using the proposed compressor blocks also show improved results.


Introduction
In order to maintain the rapid increase of high performance fidelity applications, emphasis will be on incorporation of low power energy efficient modules in future system design. The designs of such modules partially rely on reduced power dissipation in fundamental arithmetic computation units such as adders and multipliers. This motivates us to design energy efficient column compression multipliers by Wallace and Dadda in sub-threshold regime, as few published works are available in this area. The Wallace and Dadda Multipliers consist of three fundamental parts: A partial product reduction module to reduce the partial products matrix to an addition of only two operands, compressors to perform the partial product addition and a final adder part for the final computation of the binary result (Wallace, 1964;Dadda, 1965;Jayaraju et al., 2011). Generally, the partial product reduction part of multiplier contributes to maximum power consumption, delay and layout area. Law et al. (1999) have been presented a lowpower circuit for 16x16-bit Wallace multiplier. In it the 4-2 compressor circuitry, utilizes a non-full-swing pass-transistor carry generator for low power operation. Karuna and Keshab (2001) have been done the exploration of various low power higher order 4-2 and 5-2 compressors units which achieve better performance for both delay and power consumption due to modified XOR and MUX circuits. In paper Chang et al. (2004), several designs of 4-2 and 5-2 compressors capable of operating at ultra-low supply voltages range 0.6V to 3.3V have been presented. Here, the XOR-XNOR module eliminates the weak logic on the internal nodes of pass transistors with a pair of feedback PMOS-NMOS transistors.
In paper (Nirlakalla et al., 2011), 4-3, 5-3, 6-3 and 7-3 compressors have been used for high-speed multiplication. All the compressors are designed only with half adder and full adders to reduce the vertical critical path more rapidly than conventional compressors. The designed compressors have been reduced the number of steps required in the bits reduction process which increases speed of the multipliers. In paper (Shen-Fu et al., 1998), a new 3-2 counter and 4-2 compressor has been designed with Double Pass transistor Logic (DPL) to reduce the internal node capacitance on the critical path. The circuits are used to construct the partial-productsummation-tree in the parallel array multiplier. The improvement is achieved in both delay and power performance. In paper (Sreehari et al., 2007), 3-2, 4-2 and 5-2 compressor have been compared with the existing architectures in 0.18 µm CMOS technology. The compressors are analyzed using C MOS and CMOS + implementations of XOR and the MUX blocks. The architectures perform better over voltage range 0.9 to 3.3 V. In paper (Jorge and Reis, 2012), the architectures of energy efficient 3-2 and 4-2 compressors have been designed using two logic styles, traditional CMOS logic is used for the XOR-XNOR and combination of a traditional CMOS logic with Transmission Gate (TG) logic has been used for MUX. In paper published by (Abdoreza et al., 2013), a 4-2 compressor has been designed by decomposing each XOR gate to three simpler gates among AND/NAND and OR/NOR with the same collective functionality. The results show the superiority of the compressor design in terms of power, delay and PDP. Furthermore, five 54×54-bit binary multipliers based on this 4-2 compressor are faster with 7% less delay and 14% less PDP in comparison to published results. In paper (Menon and Radhakrishnan, 2006), two high-speed 5-2 compressor architectures, designed with XOR-XNOR circuits, which limit the carry propagation delay to a single compressor stage, have been presented. The simulation results of the designs show 25% improvement in speed compared architecture reported in the literature for supply voltages ranging from 1.5 to 3.3 V. In paper published by (Ohsang et al., 2002), a 5-3 compression method has been derived from a fast 2-bit adder cell, which utilizes two XOR gate delays on the critical path and one-stage dynamic CMOS circuit is used for highly customized design. The Multiply And Accumulate (MAC) designed using above 5-3 compressor shows 14.3% speed improvement in terms of XOR delay.
In this study, energy efficient MUX based compressors in sub-threshold regime have been designed. These proposed compressors are used in the design of 4×4-bit and 8×8-bit Wallace and Dadda multipliers. The multipliers comprise of MUX based AND gate array for computing the partial products, MUX based compressors for partial products addition along with MUX based Han-Carlson (HC) adder in the final stage of addition. The use of TG logic in the multiplier designs leads to reduction in PDP and number of transistors considerably. The proposed 2:1 and 4:1 MUX circuits eliminate the voltage degradations on the internal nodes of TG by adding a buffer at the output node. The simulation results show that the 2:1 and 4:1 MUX based compressor cells and the multiplier architectures function properly at supply voltages, ranging from 0.3 to 0.5 V, at 45 nm technology.
The rest of the paper organization is as follows--Section 2, describes the circuit implementation of compressor cells, 2:1 MUX, 4:1 MUX, XOR gate and gate and 2-2, 3-2, 4-2, 5-2, 6-2 and 7-2 compressors in detail with focus on their energy efficiency. Section 3, gives the simulation results of all the basic modules of the compressors. In section 4, the implementations of 4×4 and 8×8 bit Wallace and Dadda multipliers have been done using the proposed compressor cells. Section 5 describes the simulation methodology and overall experimental results of multipliers. All the proposed compressor cells and multipliers are characterized in terms of power, delay and Power Delay Product (PDP). Finally, section 6 presents a summary of the paper and the concluding remarks.

Basic Modules of Compressor and their Circuit Implementations
Conventionally, the implementations of compressors are composed of serially connected full adders and MUX. At gate level, high input compressors are anatomized into XOR gates and carry generators are normally implemented by MUXs. Therefore, different designs can be classified based on the critical path delay, in terms of the number of primitive gates. There are several designs of the XOR and MUXs presented using different logic styles by (Sreehari et al., 2007;Zimmermann and Fichtner, 1997). All the basic modules of compressors with TG logic have implemented at 45 nm technology for subthreshold operation.

MUX Vs. XOR
Multiplexers (the function of selecting the input from one of 'n' sources to its one output line) are used as one method of reducing the density of integrated circuit packages required by a particular digital circuit design. This in turn reduces the cost of the system. The TG designs 2:1 MUX, 4:1 MUX and XOR gate implementations are shown in Fig. 1-3. The channel length for all transistors is fixed at 50 nm.
The modified TG based 2:1 and 4:1 Multiplexers module eliminate the voltage degradations on the internal nodes of TG by adding a buffer at the output node. The designed circuit quickly isolates multiple signals with a minimal investment in board area and with a negligible degradation in the characteristics of those critical signals. This design provides true bidirectional connectivity without degradation of the input signal. The output buffer formed by the cascaded inverters is designed in such a way that the first inverter is half the size of the output inverter in order to cut down the power dissipation.

Compressor
The compressors are the bit-compressing cells with principal application in multi-operand addition and multiplication hardware. Therefore, performance of the compressors decides the efficiency of multiplication intensive computations. A 4-2 compressor cell can be implemented in many different logic structures. However, in general, it comprises of three main modules, the first module is required to generate XOR/XNOR function, the second module is used to generate sum and the last module is used to produce the carry output. Figure 4 shows the conventional and proposed architectures of a sample 4-2 compressor.
Conventional and proposed 4-2 compressor is shown in Fig. 4 and 5 respectively, where '4' is the number of input bits. The four numbers of inputs (X0, X1, X2 and X3) and the output SUM have the same weight as shown in Fig. 4a. The output Carry is weighted one binary bit order higher and 4-2 compressor receives an input Cin1 from the preceding module of one binary bit order lower in significance. It produces an output Cout1 to the next compressor module of higher significance as shown in Fig. 4b. At the gate level, high input compressors are anatomized into XOR gates and carry generators are normally implemented using multiplexers as shown in Fig. 4c.
In Fig. 5a, the proposed 4-2 compressor has four inputs (X0, X1, X2 and X3) and outputs (SUM, Carry0, Carry1). In contrast to the conventional design, the proposed 4-2 compressor is composed of one 3-2 compressor and two 2-2 compressors shown in Fig 5b. In the proposed 4-2 compressor, there is no Cin input received from the adjacent compressor. The 3-2 and 2-2 compressors are anatomized into multiplexers only, which in turn are implemented using TG family as shown in Fig. 5c.
The conventional and proposed 4-2 bit compressor abides by the fundamental equation as given in Equation 1 and 2:

Designs of Lower and Higher Level Compressors
In the study compressors are divided into two parts, the first is the lower level compressors that employ 2-2 and 3-2 compressors and the second is the higher-level compressors, which employ 4-2, 5-2, 6-2 and 7-2 compressors. Higher-level compressors can be derived using a single bit adder circuit. It has four/five/six/seven inputs and three outputs and these are made by using the lower level compressors. Input combinations and the corresponding decimal counts of all the proposed compressors and their functionalities are shown in Table1.

Lower Level 2-2 and 3-2 Compressors
The 2-2 and 3-2 compressors are widely used building blocks for high precision, energy efficient column compression multipliers. A 3-2 compressor can also be employed as a full adder cell which takes three inputs, X1, X2, X3 and generates two outputs Sum 'S' and Carry 'C'. A 2-2 compressor acts as a half-adder cell, takes two inputs X1, X2 and generates two outputs Sum 'S' and Carry 'C'. The proposed modified energy efficient compressors have been implemented using TG based 2:1 and 4:1 MUX in sub-threshold regime, as shown in Fig. 6. In order to demonstrate the efficiency of the new designs, we have analyzed the power consumption and other general characteristics of the 2-2 and 3-2 compressor designs against several published low power compressors. The channel length for all transistors is fixed at 50 nm.
The proposed compressors operate on sub-threshold conduction currents to perform circuit operations and give an overall PDP improvement as compared to traditional compressors.

Higher Level 4-2, 5-2, 6-2 and 7-2 Compressors
The proposed compressors utilize the standard hierarchical design approach, where the higher-level compressors are built using lower level compressors. In the proposed higher-level compressors, the carry propagation remains within the block, which simplifies the design. The internal output carries (Cout1, Cout2 and Cout3) from one of the internal blocks acts as the carry input to another block and finally generates one SUM and two carry (Carry1, Carry2) outputs as shown in Fig. 7. Input conditions Any three inputs are one ------(0,1,1) (0,1,1) (0,1,1) (0,1,1) 4 Any four inputs are one All the inputs are one (1,0) (1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1) Note: C, C1, C2 are the carry bits, S is the Sum bit of compressors. C2 is the most significant bit and S is the least significant bit In the Fig. 7, the primary inputs are shown as X1, X2, X3 ……… X7 and the primary outputs are Sum 'S', Carry1 'C1' and Carry2 'C2' respectively. These carry bits propagate to the next level of compressor as input bits. The compressors have been designed in such a way that they do not require a carry input from any of the adjacent compressor modules.

Simulation Results for Basic Modules of Compressors
All the basic modules and compressors of the referenced architectures, as cited in Table 3. The designed was performed in Cadence virtuoso EDA tool using 45 nm Technology libraries at Typical (TT) conditions. All modules are simulated at 0.4 V supply voltage to obtain their results for sub-threshold operation. Table 2 shows the results of referenced architectures in terms of Power, Delay and PDP. Table 3 to 5 give the measured power, delay and PDP of the proposed basic modules for supply voltage varying from 0.3 to 0.5 V in steps of 0.5 V for sub-threshold operation. These results show that proposed modules function properly at supply voltage as low as 0.3 V.
The overall PDP results of the proposed compressor cells given in Table 6 are better than results of referenced architectures at 0.4 V supply voltage given in Table 3. The bar chart representations of results of proposed modules are shown in Fig. 8.

DADDA and Wallace Multipliers
Two of the most well-known column compression multipliers have been presented by Wallace and Dadda. Both architectures are similar with the difference occurring in the procedure of reduction of the partial products and the size of the final adder. In Wallace's scheme, the partial products are reduced as soon as possible but Dadda's method does minimum reduction necessary at each level. The size of final adder in Wallace multiplier is also slightly smaller as compared to the adder in Dadda multiplier. All the basic standard cells are same in both 4×4-bit and 8×8-bit of column compression Wallace and Dadda multipliers.
The Block diagram of n * n bit column compression multipliers (Wallace and Dadda) using compressors is shown in Fig. 9.
These multipliers are composed of three modules: • Partial product generate module • Lower level compressors (2-2 and 3-2) and higherlevel compressors (4-2, 5-2, 6-2 and 7-2) to reduce the partial products matrix to an addition of only two operands • An HC adder for the final computation of the binary results

Partial Product Generate (PPG)
Conventionally, in the Wallace and Dadda multipliers the partial products are re-arranged in a reverse pyramid style. The PPG module is used to implement column compression for both Wallace and Dadda multipliers.
The proposed PPG module consists of bunch of MUX based AND gates, where each AND gate is implemented using 2:1 MUX as shown in Fig. 10.
The performance metrics considered for the proposed PPG modules are power, delay and PDP. To see the overall effects of these metrics, the proposed circuits are simulated at various supply voltages, ranging from 0.3 to 0.5 V as shown in Table 6.

Column Compression Technique for Dadda Multiplier
The arrangement of the partial products and the reduction stages for an 8×8-bit Dadda multiplier is shown in Fig. 11. The dots represent the partial products. The partial product matrix is reduced to a height of two using the column compression procedure developed by Dadda. The algorithm for iterative procedure is as follows: • Assuming the minimum column height i.e., h 1 = 2 and calculating remaining column height using formula h j+1 = floor (1.5*h j ) for increasing values of j. Continue this until the largest j is reached such that maximum column height for the multiplier to be designed is attained. Using this equation we get h 1 = 2, h 2 = 3, h 3 = 4, h 4 = 6, h 5 = 9 and so on. For example, in the first stage of the 8×8-bit Dadda multiplication shown in Fig. 11a, the maximum height of columns is 8, therefore, the value of h j is 6, meaning that heights of the columns are reduced to a maximum of 6. Similarly in the second stage, shown in Fig. 11b, the maximum height of column is 6 and value of h j is 4, meaning that heights of the columns are reduced to a maximum of 4 • All the columns, with heights greater than h j , are reduced to a height of h j using higher level compressors of different sizes. If the column height has to be reduced by one, use a 2-2 compressor else use a 3-2 compressor. A 4-2 compressor is used if the height has to be reduced by 3, a 5-2 compressor is used if it has to be reduced by 4 and so on and continue this step till the column height is reduced to h j • The iterations continue until two elements remain in each queue. Once such a state has been reached then the reduction phase is completed and then it can be fed to the final adder • The first element of all queues form the first input to the adder and the second element forms the second input to the adder. Energy efficient HC adder is used for the final summation of the 4×4-bit and 8×8-bit Wallace and Dadda multipliers Column compression scheme and final computation using HC adder for 4×4-bit Dadda multiplier is shown in Fig. 12.

Column Compression Technique for Wallace Multiplier
The arrangement of the partial products and the reduction stages for an 8×8-bit Wallace multiplier is shown in Fig. 13. The dots represent the partial products.
The algorithm for iterative procedure for reduction of column compression matrix to a height of 2 using higher-level compressors is described below: • Find out the maximum height of columns in the dot matrix array. If it is greater than 2, reduce the height by following the recursive procedure described below • Check the height of each column. If it is 1, no reduction is done. If it is 2, use a 2-2 compressor. Use 3-2 compressor, 4-2 compressor, 5-2 compressor and 6-2 compressor if the height of the column is 3, 4, 5 and 6 respectively else use a 7-2 compressor and check the height of column again. Continue the reduction till the height of column becomes ≤1 • Repeat the above step for all other columns and at the end, en-queue the 'sum' strings of all the counters into the same queues. The only carry in case of 2-2 and 3-2 compressors are en-queued into the next queue. In case of 4-2, 5-2, 6-2 and 7-2 compressors, the carry Carry1 is en-queued into the next queue and the carry Carry2 is en-queued into the queue following it • Again find out the maximum height of columns and continue the reduction using the above recursive procedure till maximum height reaches 2 • Stop the reduction if the height of the matrix becomes two, after which it can be fed to final adder. Once such a state has been reached then the reduction phase is completed • Once the height of matrix is reduced to two, an adder is used to generate the final product Column compression scheme and final computation using HC adder for 4×4-bit Wallace multiplier is shown in Fig. 14

Simulation Results for Dadda and Wallace multiplier
A parametric analysis on varying the PMOS transistor width was done to observe the power consumption. The power consumption is least when we have minimum size for both NMOS and PMOS. As per the conceptual analysis to allow the same current in both PMOS and NMOS, the W/L ratios of PMOS and NMOS should be in the inverse ratio of the mobility ratios of hole to electron. So to have a symmetrical design, the W/L ratios of PMOS and NMOS are taken in the ratio 2. The complete ASIC implementation of the proposed 4×4-bit and 8×8-bit of Wallace and Dadda design is also done using the Cadence design flow. The proposed design has been developed using Verilog-HDL and synthesized in Encounter RTL compiler using typical libraries of 45 nm technology at nominal supply voltage (0.7 V) using semi-custom design technique. The test bench is created for simulation and logic verification by NCSIM simulator. The Cadence SoC Encounter is used for Placement & Routing (P&R). Parasitic extraction is performed using Encounter Native RC extraction tool. The extracted parasitic RC (SPEF format) is back annotated to Common Timing Engine in Encounter Platform for static timing analysis. ASIC implementation results before and after post-layout simulation using semi custom design techniques are shown in Table 7. Table 8 show results of full custom designs of 4×4bit and 8×8-bit Wallace and Dadda multipliers in subthreshold regime. The performance parameters are power, delay and PDP. To see the overall effects of varying supply voltage on these parameters, the circuits are simulated at voltage ranging from 0.3 to 0.5 V in steps of 0.5 V. It is verified that the circuits are functional at slow-slow and fast-fast corners also. Since the delay of the designed multiplier circuit is proportional to the logarithm of the number of bits in the multiplier and the delay of its building blocks, so to measure the critical path delay and to verify the functionality of multipliers, n-numbers of test patterns have been applied. The critical path delay has been found for the input combinations {a 3-0 = 1111 and b 3-0 = 1000} for 4×4-bit Wallace and Dadda multipliers and {a 7-0 = 11111111 and b 7-0 = 00001000} for 8x8-bit Wallace and Dadda multipliers.      The 4×4-bit and 8×8-bit Wallace and Dadda multipliers of the referenced architectures, as cited in Table 9. are designed at 45 nm CMOS technology for 0.4 V supply voltage to obtain their results for subthreshold operation. It shows comparative simulation results of the proposed design with the referenced architectures, which used conventional design of compressor blocks to implement the multiplier architectures in sub-threshold regime.
The designed multipliers using proposed compressors show an overall reduction in PDP as compared to conventional multiplier architectures and give best results at 0.4 V power supply. In addition, it has been observed that the PDP improvement for Dadda multiplier is better than Wallace multiplier in subthreshold regime. The graphical representations of PDP of both multipliers are shown in Fig. 15.

Conclusion
The compressors are the basic building blocks in the column compression multipliers and hold the key for minimizing the power consumption of the overall circuit, Therefore, selecting an appropriate compressor cells can significantly improve the overall multiplication computations. The use of compressors in the multipliers reduces the overall PDP due to less number of stage operations. The main focus of this paper was to optimize overall PDP of multiplexers based compressors using TG logic family in subthreshold regime. The proposed library of power efficient compressors have been used in the design of low-power 4×4-bit and 8×8-bit Wallace and Dadda multipliers at 45 nm technology at power supply voltages, ranging from 0.3 to 0.5 V. The result show on an average 89% and 96.8% improvement in the PDP for proposed compressor blocks and multipliers at supply voltage of 0.4V, when compared with the referenced designs. The future scope includes verifying the results for larger operand size multipliers.