A Bit-Serial Multiplier Architecture for Finite Fields Over Galois Fields

,


INTRODUCTION
Public-key cryptography and symmetric-key cryptography are two main categories of cryptography. The Well-known public-key cryptography algorithms are RSA (Rivest et al., 1978), El-Gamal and Elliptic Curve Cryptography. Presently, there are only three problems of public key cryptosystems that are considered to be both secured and effective. Table 1 shows these mathematical problems and the cryptosystems that rely on such problems. Table 2 shows the complexity of calculative for each of these problems where 'n' is the length of the keys used (Sandoval, 2008;Kumar, 2008).
Providing an equivalent level of security with smaller key size is an advantage of ECC compared to RSA. It is very efficient to implement ECC. ECC obtains lower power consumption and faster computation. Given an integer RSA Problem (IFP) 'n', find its prime factorization Discrete Logarithm Given integer 'g' ELGemal, DSA Problem (DLS) and 'h', find 'x' Diffie-Hellman such that h = g × mod n (DH) Elliptic Curve Discrete Given points 'P' and ECDSA, EC-Logarithm Problem 'Q' on curve, find 'x' Diffie-Hellman (ECDLP) such that Q = xP (DH) It also gains small memory and bandwidth because of its key size length (De Dormale et al., 2004;Li et al., 2008). Such attributes are mainly fascinating in security applications in which calculative power and integrated circuit space are limited. A modular arithmetic performs a main role in public key cryptographic systems (De Dormale et al., 2004). Some of these PKC are the Diffie-Hellman keys exchange algorithm (Kaabneh and Al-Bdour, 2005), the decipherment operation in the RSA algorithm (Quisquater and Couvreur, 1982), the US Government Digital Signature Standard (Kammer and Daley, 2000) and also elliptic curve cryptography (Koblitz, 1987).
Arithmetic in elliptic curves requires a number of modules to calculate ECC operations (modular multiplication, modular division and modular addition/subtraction operations) (Al-Somani et al., 2006). The division modular is one of the most critical operations, which is expensive and computationally extensive. Many implementations are completed using projective coordinates in order to represent the points on the curve by reducing inversion/division to one. However, a final division is still needed to convert the projective coordinates into affine coordinates. In some other cases, modular division can be replaced by modular inversion followed by modular multiplication.
In this field, modular multiplication gets much attention and numerous algorithms have been published. The modular inversion can be performed using Fermat little theorem or the well-known extended Euclidian algorithm and Montgomery inverse as well (Kaliski, 1995;Savas and Koc, 2000).
In this research, the Double-and-Add alternative is used in our system as it is mainly necessary for our algorithms. Galois Field is a finite field that consists of a finite number of elements. It contains three operations which are Addition, Multiplication and division modular's. Galois Field Modular division is replaced by modular inverse followed by modular multiplication. Montgomery modular inversion method is chosen as an inversion algorithm and Right-to-left shift method as a multiplication algorithm.
Most of the hardware implementations of ECC are based on bit-parallel but in this study bit-serial architecture is used. Bit-serial operators are noticeably smaller than those operators in bit-parallel. They do not depend on word width. A multiplier is said to be bitserial if it produces only one bit of the product at each clock cycle. Moreover, bit-serial architectures only demand an equally small amount of input and output pins. An implemented multiplication in every bit-serial type has to be directly fitted to the data-width. As a result, the area complexity is reduced to O(n) and parallel multiplier O(n 2 ).
Bit-serial design makes it compulsory to operate with particular registers. These registers are able to store one bit for the period of a clock cycle. After this cycle the information is passed on to the output and the next information can enter the register. These registers offer the core functionality of shifting numbers.
Mathematical of elliptic curve cryptography: There are many ways to calculate the points over the prime field elliptic curve. A direct method is by applying the next equation: y 2 = x 3 +ax+b where 4a 3 +27b 2 ≠ 0 Different elliptic curve is produced by changing the values of 'a' and 'b'. In elliptic curve cryptography, calculating the public-key can be done by multiplying the private key with the generator point 'G' in the curve. The generator point 'G' is the point on the curve. The private key is the random number in the interval [1, n-1],'n' is the curve's order (Anoop, 2007).
The strength of ECC security comes from the difficulty of Elliptic Curve Discrete Logarithm Problem. If 'P' and 'Q' are points on the curve, then kP = Q where 'k' is a scalar. Thus, point multiplication is the basic operation in ECC. For example, the multiplication of a scalar 'k' with any point 'P' on the curve in order to obtain another point 'Q' on the curve. Point multiplication: Scalar point multiplication is a block of all elliptic curve cryptosystems. It is an operation of the form k.P. 'P' is a point on the elliptic curve and 'k' is a positive integer. Computing k.P means adding the point 'P' exactly d-1 times to itself, which results in another point 'Q' on the elliptic curve. Point multiplication uses two basic elliptic curve operations: • Point addition (add two point to find another point) • Point doubling (adding point p to itself to find another point) For example to calculate kP = Q if 'K' is 23 then kP = 23P = 2(2(2(2P) + P) + P) + P so to get the result point addition and point doubling is used repeatedly (Anoop, 2007). The result of adding point 'J' to 'K' is point 'L', which is -L reflection with respect to x-axis (Anoop, 2007).
If K = -J then there is a line through the points. 'J' and 'K' intersect the elliptic curve at a point at infinity '0' because J + (-J) = 0 as shown in Fig. 1b.
To analyze point addition, let assume J = (x j ,y j ), K = (x k ,y k ), L = J+K where L = (X l ,Y l ) and s is the incline of the line through 'J' and 'K' then: According to the Fig. 2a, 'J' is a point on the EC and to get 'L' which is equal to 2J, the tangent line at 'J' will intersect the EC at exactly point -L only if the value of 'y' axis of the point 'J' not equal to zero. However, the result of doubling is the point 'L' the reflection of the point -L with respect to x-axis (Anoop, 2007). To analyze point addition let assume J= (x j ,y j ) where y j ≠ 0, L = 2J where L = (x l ,y l ) and 'S' is the tangent at point 'J' then: and 'O' is the point at infinity if y j = 0 then 2J = O.
Elliptic curve domain parameters: Domain parameters for EC over field Fp: Elliptic curve over Fp has list of domain parameters which includes 'p', 'a',' b',' G', 'n' and 'h' parameters: 'a' and 'b': Define the curve y 2 mod p = x 3 + ax + b mod p 'p': Prime number defined for finite field Fp 'G': Generator point (XG,YG) on the EC that selected for cryptography operations 'n': The Elliptic curve order 'h': If #E(Fp) is the number of points on an elliptic curve then 'h' is cofactor where h=#E(Fp)/n Domain parameters for EC binary fields: Elliptic curve over F 2 m has a list of domain parameters which includes 'm', f(x), 'a', 'b', 'G', 'n' and 'h' parameters: 'm': An integer to finite field F 2 m F(x): The irreducible polynomial of degree m that it used for elliptic curve operations 'a' and 'b': Define the curves y 2 + xy = x 3 + ax 2 + b 'G': The generator point (x G , y G ) on the EC that selected for cryptography operations 'n': The  (Anoop, 2007). This research relies on polynomial arithmetic. Therefore, this part gives an overview of polynomial arithmetic.
EC over field F 2 m includes arithmetic of integer with length m bits. The binary string can be declared as polynomial: Binary string: (a m-1 ... a 1 a 0 ) Polynomial: a m-1 x m-1 + a m-2 x m-2 + ... + a 2 x 2 + a 1 x + a 0 where a i = 0.
For example: x 3 + x 2 + 1 is polynomial for a four bit number 11012.

MATERIALS AND METHODS
Elliptic curve algorithms: According to the hierarchy of Elliptic curve, three operations are needed by the ECC operations which are (addition, multiplication and inversion). Addition operation in binary field is an XOR operation. This part describes the basic arithmetic modular multiplication and inversion which are used in elliptic curve over Galois Fields.

Right-to-left shift-and-add field multiplication in F 2 m :
The shift-and-add for field multiplication is based on the: X(z).y(z) = x m-1 z m-1 y(z)+…x 2 z 2 y(z)+x 1 zy(z)+x 0 y(z) Repetition 'i' in the algorithm 1 compute z i y(x) mod f(z) and if xi=1 the result will be add accumulator c'. if y(z)= y m-1 z m-1 +…y 2 z 2 +y 1 z+y 0 then: y(z).z = y m-1 z m-1 +y m-2 z m-1 +…y 2 z 3 +y 1 z 2 +y 0 z y(z).z = y m-1 r(z)+(y -2 z m-1 +…y 2 z 3 +y 1 z 2 +y 0 z) (mod f(z)) So y(z).z mod f(z) can be calculated by a left-shift of the vector representation of y(z), followed by addition of r(z) to y(z) if the high order bit y m-1 is 1.
Algorithm 1: Right-to-left shift-and-add field multiplication in F 2 m: Input: Binary polynomials x(z) and y(z) of degree at most m-1 Output: c(z)=x(z).y(z) mod f(z) 1. If x 0 = 1 then ← y else c←0 2. For i from 1 to m-1 do 2.1y←y.z mod f(z) 2.2 if a i = 1 then c←c+y 3. Return (c) x' vector shift in hardware can be performed in one clock cycle, by making Right-to-left shift-and-add field multiplication algorithm that is suitable for the hardware. Kaliski (1995) introduced the Montgomery modular inverse. It definite as the Montgomery demonstration of the modular inverse, A -1 (mod P) in which 'm' is the bitlength of 'P'. Montgomery Modular Inverse Algorithm is based on the extended binary GCD algorithm. This algorithm contains two phases as shown in algorithm 4.8. The result of the second phase can be obtained either by iterative half modulo 'P' or multiplication modulo 'P'. At the end of the loop, the values of g 1 = 1and g 2 = 0 allow to check G = -A -1 2 i (mod P) which then bring the result back in the range [1, P-1]. The following algorithm rewrites the Kaliski Montgomery inverse algorithm with combination of the two phases in one algorithm.

Algorithm 2: Montgomery modular inverse algorithm:
Input: A and P, where a ∈GF (2 m ) and P is the modulus P. Output: G, where G ≡ A -1 (mod P) 1. Set U = P, V = A, G = 0 and K =1 2. Set i = 0, where I is an integer with m≤i<2m 3. While V >0 do 5. If G≥P, then set G = 2P-G, else set G = P-G. 6. Output G.

Modular multiplication:
Finite field multiplier over F 2 m always plays a major role in determining the performance of hardware accelerators of cryptography applications. It is necessary to design the multipliers with high efficiency. Bit-parallel and Bit-serial are two options to design a modular. Bit-parallel multiplier can get high operation speed by completing one multiplication in one clock cycle. On the other hand, it has maximum circuit complexity in which a large operand size makes it unsuitable. Bit-serial operators are noticeably smaller than those operators in bitparallel. They are independent of word width.

Fig. 3: Flowchart for right-to-left algorithm
A multiplier is said to be bit-serial if it produces only one bit of the product at each clock cycle. Demanding only on an equal small amount of input and output pins is an advantage of bit-serial. An implemented multiplication in every bit-serial type has to be directly fitted to the data-width. As a result, the area of complexity is reduced to O(n) than in parallel multiplier O(n 2 ). Based on the bit-serial advantages, bit-serial is chosen for our design. This research is continued with the efficient designs for polynomial multiplier operation. Right-to-left algorithm is introduces based on the bit-serial architecture (Fig. 3).

RESULTS AND DISCUSSION
Bit-serial multiplier structure: The following flowchart shows the bit-serial multiplier steps. When the multiplier bits shifted, the result is stored in 'R'. When R(n) is a 1, it indicates that the recent partial result overflows the nbit register. It also reduces one copy of the irreducible polynomial. The reduction is XOR operation. It completes the overall "modulo an irreducible polynomial" correction operation (Modares, 2009). Figure 4 is depicts a multiplier 'X' and a multiplicand 'Y' when X,Y ∈ F 2 m . It processes the bits of 'x' from left to right. The multiplier is called a Most Significant Bit (MSB) multiplier. The MSB multiplier can present a multiplication in F 2 m in 'm' clock cycles. Table 3 via f4(x) = x 4 + x + for two different 'x' and 'y' as input summarizes the behavior where ∧ symbolize "AND" gate and ⊕ symbolized an XOR gate.
It is also for 79 bits, our bit-serial multiplier needs 79 ANDs, 79 XORs and less than 400 FFs (Flip-Flops). A 79 bits multiplication is computed within 79 clocks, which is not including data input and output. In this implementation, control and memory access overheads go to a total time of execution less than 280 clocks. It starts from the processor by sending memory addresses of 'X', 'Y' and 'R' to the last result which is stored in 'R'. The multiplier is also used for squaring by loading X = Y.
The first partial result is 1101, which shift left (zero fill), then is XORed with 1101. It returns the value 11010, which goes over the 4-bit register limit. It and needs reduction using XOR with the "irreducible polynomial" which is appeared by 10011 in binary for f4(x) = x 4 + x + 1. So, in this stage the most significant bit is zeroed and 1001 is as an adjusted result. Yet, it needs to be shifted to left and it returns 10010 as a second line result which XORed to 0 ∧ (1101) in third line. As shown in Table 3, MS Bit is one. The Final Result is again 5 bits again. The partial result makes the same situation. In order to get the bit alignment, another reduction (XOR with 10011) is needed. The result of shifting in the third step is 0010. It is used in the last step. As shown in the last step, there is no overflow and the exact 4-bit is a final result.
The following block diagram shows a basic multiplier structure which gets the multiplier data serially. 'X' participates as multiplier and resides in nbit shift register 'Y' participates as multiplicand in n-bit register. 'R' is used to show the result and 'P' is put as an irreducible polynomial which is used when any overflow happens. External data input connections and Clocks are not shown in Fig. 5 noticed that the polynomial does not need to represent the leftmost bit of the polynomial.

Modular inversion:
Inversion is the most difficult finite field operation to be implemented in hardware. The division in GF (2 m ) x/y is implemented as two sequential operations, which are the inversion y −1 and then the multiplication xy −1 . Based on the Algorithm 2 Fig. 6 summarizes the Montgomery inversion algorithm that has been chosen to compute inversion in polynomial representations. Based on this structure, the diagram will never reach n+1 bit length. It deals with the correction subtraction. By observing the 'n' bit result with the leftmost bit equals to 1, the multiplication mode is set. It can be done by reducing the current result registered by the irreducible polynomials.

Bit-serial inversion structure:
We already discussed about the original Montgomery modular inverse. For the given advantages, the bit-parallel design is modified into a bit-serial structure. The modified structure is presented which is based on bit-serial architecture for the most computationally inversion algorithm over Galois field.
The previous block diagram shows the inversion architecture which loads the inputs serially. All the next steps are following algorithm 2 including the variables (U, V, K and G) to get the inversion result. If 'U' is even, then the new value of 'U' and 'K' is cleared. In order to use bit-serial, it depends on the Less Significant Bit (LSB). Then 'U' will be known either even or not. If LSB is '0', it means 'U' is even, then, it needs one shift to the right to find U = U/2 and one shift to left 'K' to find K = 2*K. There is the same row for find 'V' and 'G'.
In the architecture block diagram, there are a Multiplexer and a De-multiplexer which are controlled by using selector. They are used to pass the correct result to the output. In the De-multiplexer, if 'C0' and 'C3' are equals to '0', the output is placed in 'U'. Otherwise the value is placed in 'V'.
If (U<V) then (U+V)/2 is the exact result for 'V'. Otherwise it appears as a result for 'U'. The reminding steps are excluded from this dialog in order to not complicate the diagram.
Caused by its architecture which is based on bitserial, it has low power consumption, regular structure, low cost in term of area occupied and a reduced number of pins. Thus, it is appropriate for embedded applications.
There are two phases depicted in Fig. 7. Both phases require shift left and shift right registers. When shifting to right, the number needs to be divided by two. When shifting to the left, the number needs to be multiplied by two.
In this research, this aim is achieved by using shiftleft and shift-right which work in serial-in parallel-out. The two following algorithms are used. The most important arithmetic on elliptic curve applications is the scalar multiplication that computes 'dP'. 'd' is an arbitrary integer and 'P' is a point on elliptic curve. The scalar multiplication 'dP' can simply be clear by adding the (d-1) copies of 'P' to itself.
If the bit check starts from left to right, it is called Most Significant Bit (MSB) or double-and-add. However, if it starts from right to left, it uses Less Significant Bit (LSB) to obtain scalar multiplication.   There are many advanced researches on modular arithmetic operations over finite fields. For example, Right-to-left shift field multiplication in F2 m and Montgomery inversion method. These algorithms are used in our bit-serial hardware architecture to calculate scalar multiplication.

MSB (double-and-add):
Double-and-add algorithm is chosen as it is dictated in ANSI (Greenlee, 1999) for scalar multiplication (Vijayalakshmi and Palanivelu, 2007). The double-and-add algorithm is a fundamental technique in calculating scalar multiplication. It performs by repeating point addition and point doubling operations which is discussed earlier. In this explanation, all equations for point addition and point doubling in GF(p) and GF(2 m ) are summarized.
The numbers of ones in the binary representation of 'k' are expected to be m/2. 'm' represents the length of the integer number 'k'. The number of ones in 'k' shows how many times the point addition will be performed.
The number of times for point doubling operation performed is approximately equal to 'm'. Therefore, double-and-add algorithm averagely takes 'm' times point doubling. m/2 times point is added addition to perform mbit elliptic curve scalar multiplication at one time.
Data flow for ECC point doubling and point addition is based on the Table 4 in GF(2 m )which is presented in Fig. 8 and 9.
A Montgomery bit-serial modular inversion algorithm and Right-to-left shift multiplication bitserial are developed in this research to reduce the number of Input/output pins for scalar multiplication on elliptic curves by using double-and-add algorithm.
Therefore, in this research the scalar multiplication on elliptic curves in GF(2 m ) can be used to deal with various binary polynomials in GF(2 m ). The arithmetic which are introduced in GF(2 m ) fields are suitable to be implemented in hardware, since they are binary arithmetic.

CONCLUSION
Nowadays, RSA generally is used as public key cryptosystem in most applications that use PKC. However, recently ECC has a trend which makes it become the convenient cryptography system. ECC is also becomes substitute for RSA in efficacious applications caused by its efficiency in software as well as in hardware realizations. ECC provides a better security with shorter bit sizes than in RSA. Shorter key length saves bandwidth, power and it enhances the performance. In contrast with the experts because it can be used to build a number of cryptographic schemes that cannot be constructed in any other way. The research starts with survey of cryptography, Elliptic Curve arithmetic and Elliptic Curve operations hierarchy algorithms. Our approach is begun with competent design for finite field arithmetic, mostly focusing on inversion and multipliers. The design of efficient arithmetic algorithm in bit-serial structure for Right-to-left shift multiplication and Montgomery inversion is shown. Montgomery inversion plays a consequential task in elliptic curve scalar multiplication. A bit-serial approach minimizes the number of Input/outputs which has a direct effect on power consumption. Three macrocells per bit are exploited for the multiplicand, multiplier and the product. Eventually, Area saving can be achieved because it does not need to store reduction polynomial in a register. The result of proposed bit-serial architecture for the multiplication and inversion on finite field arithmetic appears to be an important consumption of area in comparison with others.