Low Power Modulo 2 n +1 Adder Based on Carry Save Diminished-One Number System

: Modulo 2 n +1 adders find great applicability in several applications including RNS implementations. This paper presents a new number system called Carry Save Diminished-one for modulo 2 n +1 addition and a novel addition algorithm for its operands. In this paper, we also present a novel architectures for designing modulo 2 n +1 adders, based on parallel-prefix carry computation units. CMOS implementations reveal the superiority of the resulting adders against previously reported solutions in terms of implementation area and delay.


INTRODUCTION
The Residue Number System (RNS) is a nonweighted number system [1] that can map large numbers to smaller residues, without any need for carry propagations [2] . Arithmetic operations like addition, subtraction and multiplication can be performed on residue digits concurrently and independently. Thus, using residue arithmetic, would in principle, increase the speed of computations [3,4,5] .
RNS has shown high efficiency in realizing special purpose applications such as digital filters [6,7,8,9] , image processing [10] , RSA cryptography [11] and specific applications for which only a dditions, subtractions and multiplications are used and the number dynamic range is specific.
Special moduli sets have been used extensively to reduce the hardware complexity in the implementation of converters and arithmetic operations [12,13] . Among which the triple moduli set { } 21,2,21 nnn −+ have some benefits [14] . Because of operand lengths of these moduli, the operation delay is determined by the modulo 2 n +1 channel. Therefore, the design of efficient modulo 2 n +1 adders is critical [15] . Modulo 2 n +1 operations are used in many applications such as DSP algorithms [16] , Fermat Number Transform for elimination of round off errors in convolution computations [17,18,19] , cryptography [20] and in pseudorandom number generation [21] . Modulo 2 n + 1 adders are also utilized as the last stage adder of modulo 2 n +1 multipliers.
In the last few years, several algorithms and architectures have been proposed for designing modulo 2 n +1 adders. These algorithms are based on two number systems : • To overcome the problem of (n+1)-bit wide circuits for the modulo 2 n +1 channel, the diminished-one number system [17] has been proposed. In this system, efficient adders have been reported in [14, 22 -25] . But these adders need a special treatment for zero operands. For this problem, a new number representation called "Carry Save Diminished-one" (CSD-1) is proposed in this paper. With this system, the addition with zero operand doesn't need a special treatment, which reduces the adder chip area.
However, the corresponding structure [15] uses a 3operand adder which is eliminated in our method. In the paper, we derive a new methodology for modulo 2 n +1 adder that leads to a parallel-prefix adder architecture. Using implementation in a CMOS technology, we show that the proposed parallel-prefix design methodology uses considerably less chip area than that reported in [23] (diminished-one number system) and less chip area and propagation delay than the approach reported in [15] (normal number system).
. Therefore, the reduction modulo 2 n +1 is computed by subtracting the high n-bit word from the low n-bit word and then conditionally adding 2 n +1 if the subtraction yields a negative result.

Diminished-One Number System:
In the diminishedone number system, the number A is represented by 1 AA ′ =− and the value zero is treated separately, i.e., it requires an additional zero indication bit. In this system, the ordinary addition can be implemented by an end-around-carry parallel-prefix adder with inout cc = [17,25] : Algorithm 1: (Modulo 2 n +1 addeition in diminished-1 number system): A number in diminished-one is represented by n+1 bits in which the (n+1) th bit is used to indicate '0'. In [17] , the modulo 2 n +1 addition algorithm has been presented for zero and non zero operands: 1) If the most significant bit of one addend is '1', inhibit the addition and the other addend is the sum (Fig. 1). 2) If the msb of both addends are '0', ignore the msb, add the n lsb's, complement the carry and add it to the n lsb's of the sum. The modulo 2 n +1 adder in Fig. 1 can be designed in different ways. To increase the modulo addition speed, the delay of carry computation should be minimized. In many papers, parallel-prefix adders have been proposed for this purpose. In the prefix technique, n inputs x 0 . . . x n-2 x n-1 and an arbitrary associative operator are used to compute n outputs output y i depends on all inputs x j of same or lower range (j= i). In a binary addition, the carry propagation is a prefix problem. Prefix structures can be represented by using a direct acyclic graph. The o operator on a pair of (,) ii gp terms is usually represented by a node and a carry computation unit is represented as a tree structured interconnection of such nodes. Several tree structures have been proposed in [29,30] .

CSD-1 Number System:
In the proposed method, we try to improve the performance of modulo 2 n +1 arithmetic units by using a carry save coding. Table 1 shows the new representation of numbers. As shown in table 1, this representation is composed of n positions (digits), with two bits in the first position and n-1 bits in other ones. A number X is represented as below: x? 0 We call this system "carry save diminished-one". If Also there are two bits in the first position; therefore we have a carry save representation. So we call this system "Carry Save Diminished-one" (CSD-1).
The difference between CSD-1 and diminished-one representations is that in CSD-1, the value of represented number is exactly equal to its real value. In the diminished-one, each number X is represented by 1 XX ′ =− . As shown later, CSD-1 has an advantage over diminished-one that leads to a unique circuit for zero and non zero operands. Therefore, the first step of the diminished-one addition Algorithm 1 no longer exists with CSD-1. Another benefit is that CSD-1 is extendable to any other modulo when diminishedone is only defined for modulo 2 n +1.
Step 1. The first step is based on the following theorem: Theorem 1: Let A and B be two CSD-1 numb ers in the range [0, 2 n +1]. Then, The maximal value of (3) is 2 n . In CSD-1, this value can be represented by (n+1) bits in n positions. In other words, the output carry resulting from 2 11 n AB +−+ is 0. Thus, the term (3) is transformed into: , the second case of equation (2) leads to the following inequalities: (4) Y The equation (4) outlines the impact of the output carry of (A+B-1). In the CSD-1 number system, this carry is produced when the sum is larger than 2 n . The carry generation indicates that the sum is equal or greater than the modulo. Let assume C out is the output carry of (A+B-1). Thus the carry of (A+B-1) will be generated when: Thus, if the sum of two numbers is greater than the modulo, the output carry of (A+B-1) is '1' and the sum is correct according to theorem 1: there is no need to increment the result. The output carry is zero in the following cases: In condition (*), since A+B is less than the modulo, the output carry of ( A+B-1) is '0'. According to equation (4), the sum should be incremented in the second stage. Therefore from (4), (5) But in condition (**) of equation (6), when 21 n AB +=+ , equation (7) leads to S =1, which is not true. To correct this case, we introduce step 2 of Algorithm 2 that will be presented later. In our method, (A+B-1) is computed without any extra hardware and only by ignoring 0 a ′ in above sum. As mentioned earlier, if 0 A ≠ then 0 1 a ′ = ; thus (A+B-1) will be achieved by eliminating 0 a ′ . If 0 A = then A+B will be computed by removing 0 a ′ . In this case, we have always '0' out c = and the sum will be incremented according to equation (7). But incrementing shouldn't be done to obtain the correct result. The first step of Algorithm 2 reveals that a twostage combinational circuit is required for modulo addition (adder and incrementer).
The maximal value of M is 2 n+1 which can be presented by n+2 bits or n+1 digits in CSD-1 (for this maximal value, all bits are '1'). Y In the second stage, the least n posibits of M is incremented according to (8).
Step According to theorem 2 and equation (8), if the msb of M, m n =0 (C out = 0) then M should be incremented in the second stage. Thus the final output is 2 n +1. In CSD-1, each numb er is in the range of [0, 2 n ] and can be represented by n digits. Therefore the output carry can be ignored and the output sum is "0…01" that can be corrected by inverting 0 s ′ .
In the second step of Algorithm 2, we introduce two methods to detect zero output and to correct it. a) The correct output zero occurs when two inputs are complementary, i.e. their sum is equal to modulo 2 n +1. One method to recognize complementary numbers is the logical AND of the outputs of a i XOR b i (for any i except i = 0). A similar method has been mentioned in [23] . b) Another method is based on the following theorem.
The output carry of the incrementer is '1' when the sum is equal or more than 2 n +1. That is: Equations (10) and (11) are simultaneous verified when A+B = 2 n +1, which shows that A and B are complementary. Y Method (a) has been used in [23] . However the method (b) for zero detection and correction consumes less area than method (a). Then, we implemented method (b). As described earlier and according to example 1, 0 s′ can be transformed to '0' in the condition of zero detection.

THE PROPOSEH CSD-1 PARALLEL-PREFIX ADDER (CSD-PP)
One way for implementing the CSD-PP adder is based on the adder architecture of Fig. 2. But instead of having a dedicated single stage for reentering the carry, [23] has proposed to perform carry recirculation at each existing prefix level. Then, there is no need for the extra carry increment stage. As a result, a dedicated CSD-PP adder architecture is derived with one less prefix level compared to those derived from Fig. 2 architecture. In the CSD-1 system, it requires several modifications. These modifications will be introduced by the 3 following theorems.  group a, a-1, a-2,  ... , b-1 Theorem 5 will derive expressions leading to faster circuits.
The final In several cases, the equations (12) require more than log 2 n prefix levels for their implementation. These equations can be transformed into equivalent ones that can be implemented within log 2 n prefix levels. The required transformation uses Theorem 2 of [23] , as well as the Theorem 6 that will be introduced below. Theorem 2 of [23] says that, The above formula is true when gpp ⋅=.
The following theorem is also required to derive the term that has the form ( ) ( ) Proof: First, we proof the following expression: Using this formula, we get: The carry equations resulting from theorem 2 of [23] and theorem 6 can be implemented by a prefix structure that has log 2 n levels. As mentioned earlier, we use the modifications introduced by theorems 4 to 6. Our proposed adder is similar to [23] modulo adder architecture but its first cells of preprocessing and post processing stages are designed differently.
In the CSD-1 number system, if This is a special property of CSD-1. Using this property to simplify truth tables of these two cells leads to the following equations:

RESULTS AND COMPARISONS
In this section, we compare the proposed CSD-PP adder to those proposed in [15] and [23] . As previously mentioned, the architecture proposed in [23] outperforms those presented in [24] and [25] , and the architecture proposed in [15] outperforms those presented in [26][27][28] in terms of implementation area and execution delay. Thus, the architecture of [23] is the best diminished-one architecture, and the architecture of [15] is the best architecture using normal binary representation. All architectures were described in HSPICE and mapped to the 0.18 implementation technology (0.18 µm, Vdd=1.8 v). We use VLSI implementations and a simple model to compare the proposed adder architectures to those proposed in [15 ] and [23 ] . We use the notation PPREF for the diminished-one modulo 2 8 +1 adder proposed in [23] and TPP for the normal binary one in [15] . The CSD-PP implementation for the modulo 2 n +1 adder is given in Fig. 2.
Analytical Comparisons and Results: First, we use the analytical model used in [15] and [23] , under the notation "unit-gate model". This model assumes that each gate, except the exclusive-OR gate, counts as one elementary gate for both area and delay. An exclusive-OR gate counts for two elementary gates for both area and delay. According this model, the latencies of the modulo 2 n and modulo 2 n -1 adders are equal to 2*log 2 n + 3. The PPREF modulo adder has an execution latency of 2*log 2 n + 3.
However, according to Fig. 1, the overall delay of PPREF is the modulo adder latency plus the multiplexer delay. The multiplexer is a 2-level circuit in unit -gate model. The overall delay is 2*log 2 n + 5. The TPP adder has a latency equal to 2*log 2 n + 6 and the proposed CSD-PP adder has a latency equal to 2*log 2 n + 4. The CSD-PP architecture is faster than PPREF and TPP. Therefore, the CSD-PP adder offers the fastest designs reported in the open literature. The CSD-PP adder has also the same prefix levels as the PPREF adder, without requiring any circuits for treating zero operands as shown in Fig. 1, which reduce both the execution time and the implementation area. Therefore, the proposed CSD-PP adders are more efficient than the fastest modulo 2 n +1 adder which handle operands in diminished-one representation. The normal binary system can be easily converted to the normal binary RNS. The representation of odd numbers in CSD-PP adders is the same as in TPP adders.
According to the unit-gate model, the hardware overheads of the fastest reported modulo 2 n and modulo 2 n -1 adders are respectively equal to 1.5 n * log 2 n + 5n and 3 n * log 2 n + 5n. The PPREF modulo adder has an area of 4.5 n * log 2 n +0 .5n + 6. However, according to Fig. 1, the final area of PPREF includes the modulo adder area and the area of circuit for the treatment of zero operands. The zero operand circuit area is 2n+5. Thus, the final area is 4.5 n * log 2 n +2 .5n + 11. The area of the TPP adder is equal to 4.5 n * log 2 n +3 .5n + 13 and the proposed CSD-PP adder area is equal to 4.5 n * log 2 n + 0 .5n + 15.

Real Comparisons and Results:
For evaluating the speed, area and power consumption efficiencies of each architecture, every adder is implemented by CMOS technology. The obtained results are listed in Table 2. As we can see proposed architecture leads to far faster implementations than that of [15 ] and [23] . This is due to the fact that the architecture of [15 ] requires a delay of one CSA unit and the design of PPREF in [23] uses some multiplexers to treat zero operands. The proposed architecture, on the other hand, relies on a 2-operand addition (in adverse of TPP that adds two inputs and 2 n -1) and requires unique circuit for zero and non zero operands based on CSD-1 number system.  Finally, we study power consumption of compared architectures. The simulation results are shown in Table  2. It is obvious that the proposed CSD-PP adder has the lowest consumption of all. It improves TPP and PPREF power consumptions above 23% and 26% respectively.

CONCLUSIONS
In this paper, a new number system has been presented. This paper also presents a new architecture for modulo 2 n +1 adders that uses parallel-perfix carry computation units based on mentioned number system. The proposed architecture has better performance than the conventional modulo 2 n +1 adders. The main points of the paper are summarized below: 1. The special treatment required for zero operands in the diminished-one number system has been removed. 2. The proposed architecture removes the 3-operand adder issue in the fastest modulo 2 n +1 adders with the normal binary system. 3. The proposed architecture leads to the fastest reported modulo 2 n +1 adders, with execution latencies close to the execution latency of the fastest modulo 2 n and modulo 2 n +1 adders, which means that the proposed architecture is suitable for RNS applications.