Highly Efficient Elliptic Curve Crypto-Processor with Parallel GF(2 m) Field Multipliers

This study presents a high performance GF(2 m ) Elliptic Curve Crypto-processor architecture. The proposed architecture exploits parallelism at the projective coordinate level to perform parallel field multiplications. In the proposed architecture, normal basis representation is used. Comparisons between the Projective, Jacobian and Mixed coordinate systems using sequential and parallel designs are presented. Results show that parallel designs using normal basis gives better area- time complexity (AT 2 ) than sequential designs by 33-252% which leads to a wide range of design tradeoffs. The results also show that mixed coordinate system is the best in both sequential and parallel designs and gives the least number of multiplications levels when using 3 multipliers and the best AT 2 when using only 2 multipliers.


INTRODUCTION
Recently, Elliptic Curves Cryptosystems (ECC) [1,2] has attracted many researchers and has been included in many standards [3][4][5][6][7][8] . ECC is evolving as an attractive alternative to other public-key schemes such as RSA by offering the smallest key size and the highest strength per bit. Extensive research has been done on the underlying math, security strength and efficient implementations. Among the different fields that can underlie elliptic curves, prime fields GF(p) and binary polynomial fields GF(2 m ) have shown to be best suited for cryptographic applications. In particular, binary fields allow for fast computation in software as well as in hardware. Small key sizes and computational efficiency make ECC not only applicable to hosts processing security protocols over wired networks, but also to small wireless devices such as cell phones, PDAs and Smartcards.
Inversion operations, which are needed in point addition over Elliptic Curves are the most expensive operation over Finite Fields [9][10][11][12] . The approach adopted in the literature is to represent Elliptic Curve points in projective coordinate in order to replace the inversion operations with repetitive multiplications [9][10][11][12][13][14][15] . Recently, several ECC processors have been proposed in the literature [10][11][12]14,15] based on projective coordinate representation. There are many projective coordinate systems to choose from. In exiting architectures, the selection of a projective coordinate is based on the number of arithmetic operations, mainly multiplications. This is to be expected due to the sequential nature of these architectures where a single multiplier is used.
For high performance servers, such sequential architectures are too slow to meet the demand of increasing number of users. For such servers, highspeed crypto processors are becoming crucial. One solution for meeting this requirement is to exploit the inherent parallelism within Elliptic curve point operations in projective coordinate. Recently, ECC processor architectures have been proposed where the choice of the projective coordinate system used also depends on its inherent parallelism [11,12] . Since multiplication is the most dominant operation and most time consuming when computing point operations in projective coordinate, three multipliers that can work in parallel are used in the architectures in [11,12] . These architectures give better area-time complexity (AT 2 ) than the architectures that are based in a single multiplier. In this study we are proposing an alternative parallel design using normal basis representation which is more suitable for hardware implementations. In addition, the complexity and parallelism in several homogenous and heterogeneous projective coordinate are given.

GF(2 m ) Arithmetic background:
The finite GF(2 m ) field has particular importance in cryptography since it leads to particularly efficient hardware implementations. Elements of the field are represented in terms of a basis. Most implementations use either a Polynomial Basis or a Normal Basis [16] . For the proposed cryptoprocessor described in this study, a normal basis is chosen since it leads to more efficient hardware implementations. Normal basis is more suitable for hardware implementations than polynomial basis since operations are mainly comprised of rotation, shifting and exclusive-OR operations which can be efficiently implemented in hardware. A normal basis of GF(2 m ) is a basis of the form (ß, ß 2 , ß 4 , ß 8 , ….. ß 2^(m-1) ) , where ß ∈ GF(2 m ) In a normal basis, an element A ∈ GF(2 m ) can be uniquely represented in the form (1) An optimal normal basis (ONB) [17] is one with the minimum number of terms in (2.1), or equivalently, the minimum possible number of nonzero ij . This value is 2m-1 and since it allows multiplication with minimum complexity, such a basis would normally lead to more efficient hardware implementations.
Inversion: Inverse of a ∈ GF(2 m ), denoted as a -1 , is defined as follows. 1 1 mod 2 m aa − = Most inversion algorithms used are derived from Fermat's Little Theorem: for all a 0 in GF(2 m ). Itoh and Tsujii inversion algorithm [18] , however, is one of the most efficient inversion algorithms that have been proposed thus far.
Elliptic curves: Here we present a brief introduction to elliptic curves. Let GF(2 m ) be a finite field of characteristic two. A non-supersingular elliptic curve E over GF(2 m ) is defined to be the set of solutions (x, y) ∈ GF(2 m ) X GF(2 m ) to the equation, y 2 + xy = x 3 + ax 2 + b, where a and b ∈ GF(2 m ), b 0, together with the point at infinity denoted by O. It is well known that E forms a commutative finite group, with O as the group identity, under the addition operation known as the tangent and chord method. Explicit rational formulas for the addition rule involve several arithmetic operations (adding, squaring, multiplication and inversion) in the underlying finite field. In affine coordinate, the elliptic group operation is given by the following.
Let P = (x 1 , y 1 ) ∈ E; then -P = (x 1, x 1 + y 1 ). For all P ∈ E, O + P = P + O = P. If Q = (x 2 , y 2 ) ∈ E and Q -P, Computing P + Q is called elliptic curve point addition if P Q and is called elliptic curve point doubling if P = Q. Scalar multiplication is the basic operation for ECC. Scalar multiplication in the group of points of an elliptic curve is the analogous of exponentiation in the multiplicative group of integers modulo a fixed integer m. Computing dP can be done with the straightforward double-and-add approach based on the binary expression of d = (d l-1 ,…,d 0 ) where d l-1 is the most significant bit of d. However, several scalar multiplication methods have been proposed in the literature. A good survey is presented by Gordon in [19] .

Projective coordinate in GF(2 m ):
The projective coordinate are used to eliminate the need for performing inversion. For elliptic curve defined over GF(2 m ), many different forms of formulas are found [9,20,22] for point addition and doubling. The projective coordinate system (Pr), so called homogeneous coordinate system, have the form (x,y)=(X/Z,Y/Z) [20] , while the Jacobian coordinate system have the form (x,y)=(X/Z 2 ,Y/Z 3 ) [9] . From the Jacobian coordinate system, two other coordinate systems where proposed. These are: the Chudnovsky Jacobian coordinate system (J c ) representing the point with the quintuple (X, Y, Z, Z 2 , Z 3 ) and the Modified Jacobian coordinate system (J m ) representing the point with the quadruple (X, Y, Z, aZ 4 ). Mixed coordinate was proposed in [22] leading to better performance.
Projective and Jacobian coordinate systems since other field arithmetic operations requires negligible time as compared to multiplication. This is because of the nature of normal basis over GF(2 m ) which performs addition and subtraction simply by an XOR operation and performs squaring by a single rotation as pointed earlier.
ECC Crypto-processor architecture: This section defines the basic idea and the proposed generic architecture of the ECC crypto-processor. Also, the methodology used to find the number of multipliers in each parallel design will be discussed.

Generic ECC Crypto-processor architecture with multi-multipliers:
The basic idea is based on the parallelism of projective coordinate multiplications proposed in [11,12] . Three multipliers were employed to provide parallelism to provide better AT 2 .
The work reported in [11,12] was represented in polynomial basis and squaring was considered to be a multiplication, which can be negligible in normal basis or when using irreducible trinomial [21] . This makes a big difference in the number of multiplication cycles as is discussed in the next section. The proposed generic crypto-processor architecture is based on normal basis and uses 2-4 multipliers, a cyclic shift register to perform squaring, an XOR unit for field addition and a register file. Only one cyclic shift register and XOR unit is used since both squaring and filed addition requires only one clock cycle and hence it can be reused several times while a single multiplication operation is computed. Each of these arithmetic units can get operands from the register file and store the result in the register file. The controller generates control signals for all the arithmetic units and the register file (Fig. 1).

Methodology used to find the number of multipliers:
Since multiplication is the dominant operation in elliptic curve point operations in projective coordinate and since the computation time of multiplication is much higher than field squaring and addition, the emphasis in this study is to speed up the computations of point operations in projective by performing more than one multiplication operation at any one time. The approach adopted in this study is: 1. Analyzing the dataflow of point operations for each projective coordinate system in the following manner: Find the critical path which has the lowest number of the multiplication operations. Find the maximum number of multipliers that are needed to meet this critical path. 2. Varying the number of multipliers from one to the number of multipliers specified by the critical path to find the following: Find the best schedule of each dataflow using the specified number of multipliers.
Find the AT 2 .
The critical paths of the Projective and Jacobian coordinate systems are listed in Table 2 for both the point addition and doubling. Mixed coordinate system's critical path is chosen as the best critical path than can be reached among all other mixed coordinate systems. The critical path of the Projective coordinate system is 4 and 3 for point addition and doubling respectively. From Table 1, we can see that the total number of multiplications needed with the projective coordinate system is 16 and 7 for point addition and doubling respectively. This means that using one multiplier gives an average of (16/2) + 7 = 15 multiplications cycles since, on average, we perform doubling for all the bits in the key and perform point addition only for half of the key bits. Table 2 summarizes the average number of multiplications cycles required for point operations using 1, 2, 3 and 4 multipliers and Table 3 shows clearly the advantage of using parallel designs reducing the average number of multiplications cycles when using Mixed coordinate system. It is worth noting that unlike the work reported in [11,12] where polynomial basis is used and squaring was considered to be a multiplication, which can be negligible when using normal basis or when using irreducible trinomial [21] . This makes a big difference in the number of multiplication cycles as can be seen from Table 2 and 3 and also has a significant impact on the utilization of multipliers.

RESULTS
In Table 4, comparisons between the different coordinate system are shown. Four cases are covered in these comparisons: Single multiplier (Sequential), Two, multipliers (Parallel), Three multipliers (Parallel) as in [11,12] and Four multipliers (Parallel).
The results in Table 4 show that the parallel designs are always giving better AT 2 than the sequential design by 33-252% (Table 5). This wide range of enhancements provides the designers with large range of trade-offs.
It is clear from Table 4 that with the Projective coordinate system, the enhancement in the AT 2 increases by employing more multipliers. The maximum number of multipliers that can be reached that satisfies the critical path was found to be 4 multipliers. The enhancements using parallel designs with the Projective coordinate system, as shown in Table 5, was found to be 76%, 108% and 252% when using 2, 3 and 4 multipliers respectively. However, the Projective coordinate system was giving better AT 2 than Jacobian coordinate system when employing 4 multipliers, while it was giving worse results by using less number of multipliers.
Only the Jacobian projective coordinate system can benefit from using 5 multipliers and requires an average of 4 multiplication cycles which is the same as what the  Table 5: Comparison between the different designs based on Table 4 Enhancement  Projective coordinate gives with only 4 multipliers. Also, we can notice that using 3 multipliers, as in [11,12] , is giving better result than using 4 multipliers with the Jacobian coordinate system (Table 4). This shows clearly that adding more multipliers does not necessarily increase performance as depicted in Fig. 2. However, the best results reported in Table 4 were found to be when using the Mixed coordinate system. It is clearly obvious that Mixed coordinate is giving always the best AT 2 as compared to others. It also can be easily seen from Table 2 that using 4 multipliers will give the same multiplications cycles as when using only 3 multipliers. From Table 4 and 5, we can see that 2 multipliers give absolutely the best AT 2 in comparison to all other implementations including the use of a single multiplier. What is a more significant observation from Table 4 and 5 is that using the proposed architecture with Mixed coordinate system is not only faster for parallel implementation but it also leads to a better AT 2 (cost) than other alternatives.

CONCLUSION
In this study we presented a high performance GF(2 m ) Elliptic Curve Crypto processor. Parallelism was exploited at the projective coordinate level using 2, 3 and 4 multipliers to perform parallel field multiplications represented in optimal normal basis. Comparisons between the Projective, Jacobian and Mixed coordinate systems using sequential and parallel designs was also presented. The results show that using parallel designs in optimal normal basis gives better AT 2 than sequential designs by almost 33-252% which gives the designers a wide large of design tradeoffs. The results also show that mixed coordinate are the best in both sequential and parallel designs and gives the least multiplications cycles using 3 multipliers and the best AT 2 with only 2 multipliers.