Efficient Reversible Montgomery Multiplier and Its Application to Hardware Cryptography

Problem Statement: Arithmetic Logic Unit (ALU) of a crypto-processor and microchips leak information through power consumption. Although the cryptographic protocols are secured against mathematical attacks, the attackers can break the encryption by measuring the energy consumption. Approach: To thwart attacks, this study proposed the use of reversible logic for designing the ALU of a crypto-processor. Ideally, reversible circuits do not dissipate any energy. If reversible circuits are used, then the attacker would not be able to analyze the power consumption. In order to design the reversible ALU of a crypto-processor, reversible Carry Save Adder (CSA) using Modified TSG (MTSG) gates and architecture of Montgomery multiplier were proposed. For reversible implementation of Montgomery multiplier, efficient reversible multiplexers and sequential circuits such as reversible registers and shift registers were presented. Results: This study showed that modified designs perform better than the existing ones in terms of number of gates, number of garbage outputs and quantum cost. Lower bounds of the proposed designs were established by providing relevant theorems and lemmas. Conclusion: The application of reversible circuit is suitable to the field of hardware cryptography.


INTRODUCTION
Power analysis is a physical attack to cryptosystems such as smart card, tamperproof "black box" and microchip. It exploits the fact that the power dissipation of an electronic circuit depends on the actions performed in it. Kocher et al. [1] describe Simple Power Analysis (SPA) and Differential Power Analysis (DPA) attacks which use the power-dissipation characteristics as a provider of side-channel information. Using DPA, an attacker can extract information on secret keys by statistically analyzing power consumption measurements from multiple cryptographic operations performed by a cryptoprocessor.
DPA is more difficult to prevent, since even small biases in the power consumption can lead to exploitable weaknesses. In this study, the authors propose the use of reversible logic to protect the crypto-systems from power analysis attacks. According to Landauer [2,3] , in logic computation every bit of information loss generates kTln2 joules of heat energy where k is Boltzmann's constant of 1.38×10 −23 J/K and T is the absolute temperature of the environment. At room temperature the dissipating heat is around 2.9×10 −21 J. Energy loss due to Landauer limit is also important as it is likely that the growth of heat generation causing information loss will be noticeable in future. Reversible circuits are fundamentally different from traditional irreversible one. In reversible logic, no information is lost, i.e., the circuit that does not lose information is reversible. Bennett [4] showed that zero energy dissipation would be possible if the network consists of reversible gates only. Thus the proposed reversible hardware will prevent any type of power analysis attack, since no energy will be dissipated from reversible circuits. Modular multiplication is the most common operation in the cryptosystems, such as RSA, Elliptic Curve Cryptography (ECC), Digital Signature Algorithm (DSA) and Diffie-Hellman key exchange. It is also the critical part of the computing efficiency in the cryptosystem which involves modular multiplications with large integers for enhancing its security [7] . The most popularly used method for the fast implementations of modular multiplication is Montgomery's algorithm [8] . To avoid long carry propagation during the addition stages of the computation, several techniques such as systolic array and Carry Save Adder (CSA) architecture were found in the literature [6,7] . This study focuses on the reversible CSA architecture implementation of Montgomery multiplier.

MATERIALS AND METHODS
Reversible gate: Reversible Gates are circuits in which the number of outputs is equal to the number of inputs and there is a one to one correspondence between the vector of inputs and outputs [9] .
Let the input vector be I v , output vector be O v and they are defined as follows, I v = (I i , I i+1 , I i+2 … I k-1 , I k, ) and Garbage output: Unwanted or unused output of a reversible gate (or circuit) is known as Garbage Output.
Feynman gate (FG) [19] is used to perform Exclusive-OR between two inputs. But in that case, one extra output will be generated as well, which is the garbage output as shown in Fig. 1 with * .
Some major reversible gates required for this study are Fredkin gate (FRG) [10] , Peres gate [11] , TSG gate [12] , modified TSG (MTSG) gate [13] and HNFG gate [20] which are shown in Fig. 2-6 respectively.    Quantum cost: Every reversible gate can be calculated in terms of quantum cost and hence the reversible circuits can be measured in terms of quantum cost. The quantum cost of every 2×2 gate is the same and the cost is unity [5,15] . According to [5] a 1×1 gate costs nothing and every quantum gate can be realized from 1×1 and 2×2 gates and its cost calculated as a total sum of 2×2 gates used.
Reversible logic in hardware cryptography: The main source of power consumption in hardware cryptography is the ALU of a crypto-processor. It consists of CSA, multipliers, registers, shift registers, accumulators and multiplexers. Therefore, the ALU of a crypto-processor can be designed using reversible logic so it will not dissipate any heat. Each component of the crypto-processor is described here.
Proposed reversible four-to-two CSA: TSG gate is very popular to construct the full adder circuit. But TSG is very complex in nature and its quantum cost is extremely high which is 13. MTSG is very useful to realize full adder as its quantum cost is very low (only 6) as compared to the TSG. Realization of full adder using MTSG is shown in Fig. 8. Using MTSG gates Fig. 9 shows the proposed four-to-two reversible CSA. Table 1 shows that the proposed reversible CSA requires lower quantum cost than the existing one found in the literature [16] .
Proposed reversible register: Figure 10 shows the implementation of reversible clocked D flip-flop [16,17] . The reversible D flip-flops can be used to implement a reversible register. The proposed n-bit reversible register is shown in Fig. 11.
Theorem 1: An n-bit reversible register can be realized by at least 2n gates and n+1 garbage outputs.
Proof: An n-bit reversible register is designed using n reversible clocked D flip-flops. From [16,17] , each reversible D flip-flop contains one Fredkin gate and one Feynman gate, a total of two gates and produces two garbage outputs. In reversible register, CLK output of a Fredkin gate is connected to the CLK input of the Fredkin gate of next D flip-flop. Thus, reversible register reduces one garbage output from each D flipflop except the last one. Therefore, an n-bit reversible register can be realized by at least 2n gates and n+1 garbage outputs.   [16] 13×2 = 26 Proposed Circuit using MTSG 6×2 = 12   [16] contains multiple fan-outs, which are forbidden in strict reversible sense Lemma 2: The quantum cost of an n-bit reversible register is at least 6n.
Proof: From [5,15] , quantum costs of Feynman gate and Fredkin gate are one and five respectively. Since from [16,17] , each reversible D flip-flop contains one Fredkin gate and one Feynman gate, the quantum cost of D flip-flop is 1+5 = 6. Since there are n D flip-flops, the quantum cost of an n-bit reversible register is 6n.
Comparative results of different reversible n-bit registers are shown in Table 2.
Proposed reversible shift register: Design of reversible master-slave D flip-flop [17] and its block diagram are shown in Fig. 12 and 13 respectively. Proof: An n-bit reversible SISO shift register can be designed using n reversible master-slave D flip-flops. From [17] , each reversible master-slave D flip-flop contains two Fredkin gates, two Feynman gates and one reversible NOT gate, a total of five gates and produces three garbage outputs. In reversible shift register, CLK output of a D flip-flop (2nd Fredkin gate) is connected to a reversible NOT gate and the inverted output is connected to CLK input of the next D flip-flop (1st Fredkin gate of next D flip-flop). Therefore, reversible shift register reduces one garbage output from each D flip-flop except the last one, i.e. it produces 2(n-1) garbage outputs for first n-1 D flipflops and 3 garbage outputs for last D flip-flop. So, total garbage output is 2(n-1) + 3 = 2n+1.
The n-bit reversible shift register requires 5n gates for n master-salve D flip-flops and n-1 reversible NOT gates, a total of 6n-1 gates. Therefore, an n-bit reversible shift register can be realized by at least 6n-1 gates and 2n+1 garbage outputs.

Lemma 4:
The quantum cost of an n-bit reversible SISO shift register using master-slave D flip-flops is at least 12n.
Proof: From [5,15] quantum costs of reversible NOT gate, Feynman gate and Fredkin gate are zero, one and five respectively. From [17] , each reversible master-slave D flip-flop contains two Fredkin gates, two Feynman gates and one reversible NOT gate. So, the quantum cost of master-slave D flip-flop is 5×2+1×2+0×1 = 12. Since an n-bit reversible SISO shift register contains n master-salve D flip-flops, the quantum cost is 12n.  [16] 6n 4n+1 14n Proposed circuit 6n-1 2n+1 12n The optimality of the proposed design can be easily understood from Table 3 which shows the comparative study using existing reversible shift register. The proposed design is less costly in terms of number of gates, garbage bits and quantum costs than the existing one Proposed reversible Parallel-In, Parallel-Out (PIPO) shift register using clocked D flip-flops: In PIPO shift register, all data bits are loaded into the register at once with the next clock pulse. After shift operation all data bits appear on the parallel outputs immediately. The control inputs (HOLD, E) select the operation of the register according to the function entries in Table 4.
From Table 4, when both HOLD and E are low, the shift register performs the shift-right operation. When HOLD is low and E is high, the inputs I 1 , I 2 , …, I n are loaded in parallel into the register coincident with the next clock pulse. The outputs O 1 , O 2 , O 3 , …, O n are available in parallel from the Q output of the flip-flops. When HOLD is high, present value of flip-flop is applied to the D input of that flip-flop. In other words, the register is inactive when HOLD is high and the contents are stored indefinitely.
The characteristic function of Q i + can be obtained from For the first stage Q i-1 is the serial input (SI) and for the last stage Q i is the serial output (SO). Thus, this function can be implemented by only two Fredkin gates as shown in Fig. 15. From Fig. 15, it is clear that it generates 4 garbage outputs. As the quantum cost [15] of a Fredkin gate is 5, the implementation of characteristic function of Q i + requires 10 quantum cost. Figure 16 shows the basic cell of the proposed PIPO shift register using clocked D flip-flop and its block diagram. By cascading n basic cells, PIPO shift register can be implemented as shown in Fig. 17. Thus, each basic cell requires 2+3 = 5 gates. Each of the first n-1 basic cells produces 2+1 = 3 garbage and the last cell produces 4+2 = 6 garbage outputs. As an nbit PIPO shift register has n basic cells, it requires 5n gates and produces (n-1)×3+6 = 3n+3 garbage outputs.

Lemma 7:
The quantum cost of an n-bit reversible PIPO shift register using D flip-flops is at least 18n. Proof: From theorem 5, generation of Q i + requires quantum cost of 10. From [5,15] quantum costs of Feynman gate and Fredkin gate are one, five respectively. From Fig. 6 quantum cost of HNFG gate is two. D flip-flop of each basic cell requires one Fredkin gate, one Feynman gate and one HNFG gate, a total quantum cost of 5+1+2 = 8. Thus, an n-bit reversible PIPO shift register requires 10×n+8×n = 18n quantum cost. Proposed 2-input reversible multiplexer: As Fredkin gate is a controlled swap gate, multiplexer can be implemented using this gate. Figure 18 shows a proposed 2-input n-bit reversible multiplexer where S is the select input, A 1 A 2 A 3 …A n and B 1 B 2 B 3 …B n are two inputs. If S = 0, then (Z 1 Z 2 Z 3 …Z n ) = (A 1 A 2 A 3 …A n ) or if S = 1, then (Z 1 Z 2 Z 3 …Z n ) = (B 1 B 2 B 3 …B n ).
Therefore, a 2-input n-bit reversible multiplexer requires n Fredkin gates and produces n garbage outputs.
Proposed architecture of reversible montgomery modular multiplier: In hardware cryptosystems, the Montgomery multiplication algorithm [7,8] is used for modulo multiplication. The modified algorithm presented in [7] is very efficient. The authors make this algorithm compatible for reversible circuit and improve the efficiency of the proposed architecture. Figure 19 shows the reversible implementation of the Modular Multiplier using the reversible components shown in this study. In the Fig. 19, darker lines represent multiple bits whereas lighter lines represent single bit. The AND-1 block performs the AND operation with X i and Y which can be implemented using Peres gates. The AND-2 block performs the AND operation with SUM 0 and M. To get another copy of SUM 0 , one Feynman gate cam be used but for simplicity this is not shown in Fig. 19. The COPY blocks are used to copy the signals to avoid the fan-out problems. These can be implemented using Feynman gates.
The working principle of this proposed reversible architecture is described in Algorithm 1. After the execution, the SUM register stores the result XY×2 -(n+2) mod M.

Algorithm 1 (Montgomery modular multiplication):
Inputs: X,Y,M with 0 X, Y < 2M and 2 n-1 < M < 2 n Output: X×Y×2 -(n+2) mod M X i : i th bit of X SUM 0 : LSB of SUM MMM (X, Y, M) Step 1: Initialize SUM and CARRY to 0 Step 2: Set i := 0 Step 3: Set LC := 0 Step 4: Repeat Steps 5 to 9 while i n + 1 Step 5: Perform carry save addition on SUM, CARRY and X i ×Y using reversible CSA-1 Step 6: The outputs (SUM, CARRY) from reversible CSA-1 are fed into reversible CSA-2. Add these outputs with SUM 0 ×M using reversible CSA-2 Step 7: Store the results of above Step into reversible registers (SUM and CARRY) Step 8: Perform right shift operation on SUM and CARRY by 1 bit using reversible shift registers. The results (SUM and CARRY) of these operations are fed back into the inputs of CSA-1 Step 9: Set i := i + 1 Step 10: Set LC := 1 Step 11: Repeat Steps 12 and 13 while CARRY 0 Step 12: Perform carry save addition on SUM and CARRY using reversible CSA-1 and CSA-2 Step 13: Store the results of above Step into reversible registers (SUM and CARRY) Step 14: Return SUM = X×Y×2 -(n+2) mod M

RESULTS AND DISCUSSION
Evaluation of the proposed architecture of reversible montgomery multiplier: Advantages of the proposed reversible architecture over existing reversible one [16] are as follows: • Proposed architecture does not require any carry propagation logic (reversible ripple carry adder or carry look-ahead adder). Thus fast summation is possible.
• As it reuses the reversible CSA architecture to perform the full addition, it reduces area required for the circuit. • This architecture is simple and efficient as it requires only two reversible CSAs. • Unlike [16] , this multiplier does not require to perform any subtraction operation.
Theorem 8: The proposed n-bit reversible Montgomery multiplier can be realized by at least 22n+25 gates and 16n+28 garbage outputs with quantum cost of 80n+91.
Proof: Since 2 n-1 <M<2 n , M is n bits long and since 0 X, Y<2M, we assume that X and Y are n+1 bits long.

Applications of reversible montgomery multiplier:
Montgomery multiplication is used in modular arithmetic as an efficient way of performing an exponentiation of two numbers modulo a large number, that is, A B mod N. Algorithm 2 [18] demonstrates the computation of A B mod N used in RSA encryption and decryption functions. Step 7: Set i := i -1 Step 8: Set R := MMM(R, 1, N) Step 9: Return R Algorithm 2 demonstrates the use of reversible architecture of Montgomery multiplier. Therefore this reversible architecture can be used in RSA, DSA, Diffie-Hellman key exchange and ECC cryptoprocessors to thwart DPA attacks.

CONCLUSION
In this study, reversible logic syntheses were carried out for the primitive components of the ALU of a reversible crypto-processor. The proposed reversible design of Montgomery multiplier requires less hardware, less area and it is faster, more cost effective than the existing one. It has been found that the proposed reversible sequential designs are far better than the existing ones in terms of number of gates needed, number of garbage outputs produced and quantum cost required. A cryptosystem for protection of power analysis attack of DPA is an application of reversible logic with hardware cryptography.