Reconfigurable Elliptic Curve Crypto-Hardware Over the Galois Field GF(2 163 )

: Problem statement: In the last decade, many hardware designs of elliptic curves cryptography have been developed, aiming to accelerate the scalar multiplication process, mainly those based on the Field Programmable Gate Arrays (FPGA), the major issue concerned the ability of embedding this strategic and strong algorithm in a very few hardware. That is, finding an optimal solution to the one to many problem: Portability against power consumption, speed against area and maintaining security at its highest level. Our strategy is to hardware execute the ECC algorithm that reposes on the ability of making the scalar multiplication over the GF(2 163 ) in a restricted number of clock cycles, targeting the acceleration of the basic field operations, mainly the multiplication and the inverse process, under the constraint of hardware optimization. Approach: The research was based on using the efficient Montgomery add and double algorithm, the Karatsuba-Offman multiplier and the Itoh-Tsjuii algorithm for the inverse component. The hardware implementation was based upon an optimized Finite State Machine (FSM), with a single cycle 163 bits multiplier and a script generated field squarer. The main characteristics of the design concerned the elimination of the different internal component to component delays, the minimization of the global clocking resources and a strategic separation of the data path from the control part. Results: The working frequency of our design attained the 561 MHz, allowing 161786 scalar multiplications per second, outperforming one of the best state of the art implementations (555 MHz); the other contribution concerns the acceleration of the field inverse scheme with a frequency of 777.341 MHz. Conclusion: The results indicated that using different optimizations at the hardware level improve efficiently the acceleration of the ECC scalar multiplication and the choice of the target circuit gratefully enhances propagation delays and increases frequency.


INTRODUCTION
In the last decade, the approach of hardware implementing Elliptic Curve Cryptography algorithms (ECC) knew a very intensive race, due essentially to the requirements of security, speed and area constraints. In fact, security deals mainly with the ability to face counter-attacks [1] , while speed and area which represent the eternal trade-off, that concern the ability to make intensive cryptographic processes, while keeping used hardware as low as possible. In other words, it is the ability of embedding a strategic and strong algorithm in a very few hardware. That is, finding an optimal solution to the one to many problem: Portability against power consumption, speed against area, but the main issue in cryptography is security.
Cryptography has become one of the most important fields in our life, due essentially to two factors, increase in secrecy and increase in breaking code or hackers in the other side. It is no more safe to use its birth date or the name of its child, as a common password in some banking or even mailing accounts.
Organizations tend to increase their benefits by keeping their information system as transparent as possible. On the other hand hackers and code or key breakers are being organized in a kind of unofficial groups; this leads to being a step ahead before getting the codes breakdown.
Scientists are tending to complicate the reverse engineering process of the encryption system, at the same time, keeping encryption keys as low as possible. This issue is being tackled by many mathematics, mainly those working on elliptic curves [2] .
The beauty of this new field is potentially related to the simplicity of the operators used in the encryption process, to the non-secure transmission constraints used in the exchange of the keys and to the enhanced complexity that might face hackers when unwanted information goes out of the organization. Why elliptic curve cryptography? : In 1985, Koblitz and Miller introduced the use of elliptic curves in public key cryptography. called Elliptic Curve Cryptography (ECC), Basically, the main operation of elliptic curves consists of multiplying a point by a scalar in order to get a second point, the complexity arises from the fact that given the initial point and the final point, the scalar could not be deduced, leading to a very difficult problem of reversibility, or crypto analysis, called also the elliptic curve discrete logarithm problem.
The ECC algorithms with their small key sizes present nowadays the best challenge for cryptanalysis problems compared to RSA or AES, thus dealing with ECC will lead to smaller area hardware, less bandwidth use and more secure transactions, as shown in (Table 1).
The attractiveness of ECC algorithms is that they operate on a Galois Field (GF), by means of two simple operations, known as the field addition and field multiplication, which define a ring over GF(p m ) where p and m are primes. In the particular case, where we deal with hardware implementations, a binary field is preferred, where the couple (p, m), defines the set of elliptic curves. In our case, p = 2 and m = 163.
In this research, we present an FPGA hardware implementation of the elliptic curve cryptography scheme, using the Montgomery scalar multiplication based on the "add and double" algorithm, targeting as a primary goal an increase in the speed of the hardware implementation and an optimization in the ensuing inverse component.

MATERIALS AND METHODS
Hardware implementation : The strategy of hardware executing the ECC algorithms reposes on the ability of making the scalar multiplication in the GF(2 m ) in a very few clock cycles. While increasing m, implementations become very time and resource consuming.
Most of the known architectures concern the acceleration of the multiplication process by modifying the elliptic equations by changing the Z coordinate term [4] , or by multiplication scalability [5] ,or by using many serial and parallel Arithmetic units [6] , or using High parallel Karatsuba Multipliers [7] , those based on the Massy-Omura multipliers [8] , or the work based on a hybrid multipliers approach [9] , also some parallel approaches [10] , or the New word level structure [11] , or through the systolic architecture of [12] , or by using the half and add method of [13] , or by parallelizing both the add and double Montgomery algorithms [14] .
The second problem concerns the inversion which has been tackled by [15] , based on the Fermat little theorem of [16] , or the almost inverse algorithm based on Kaliski's research [17] .
In order to concentrate on one of the problems, some modifications have been done on the ECC equations [18] in order to postpone inversion to the last stage, while dealing only with the multiplication process.
In the next part we present the mathematical background of ECC, while in the material and methods section, we present the FPGA hardware proposed implementation, followed by the simulation results, at last we complete this study by a discussion and a final conclusion.
Elliptic curve mathematical background: ECC is based on the discrete logarithm problem applied to elliptic curves over a finite field. In particular, for an elliptic curve E that relies on the fact that it is computationally easy to find: Where: P and Q = Points of the elliptic curve E and their coordinates belong to the underlying GF (2 m ) k = A scalar that belongs to the set of numbers {1…#G-1}, G being the order of the curve E Nowadays, there is no known algorithm able to compute k given P and Q in a sub exponential time [18] .
The equation of a non-super singular elliptic curve with the underlying field GF(2 m ) is presented in Eq. 2. It is formed by choosing the elements "a" and "b" within GF(2 m ) with: In the affine-coordinate representation, a finite field point on E(GF(2 m )) is specified by two coordinates x and y both belonging to GF(2 m ) satisfying Eq. 2 The point at infinity has no affine coordinates.
In most ECC hardware designs the choice of using three coordinates reposed on avoiding the periodic division of Eq. 3, which consumes a lot of resources in terms of execution cycles, as well as memory and power consumption: A point is converted from a couple of coordinates to a triple system of coordinates using one of the transforms of ( Table 2).
In our implementation, the Lopez-Dahab mapping is applied, because the set of operations is reduced compared to the other mappings [13] as presented in (Table 3).
Thus a point P(x, y) is mapped into P(X,Y,Z), that is a third projective coordinate is introduced in order to "flatten" the equations and avoid the division.
The startup transformation required for the implementation is simply done by initializing X, Y and Z as in Eq. 4 [20] : Introducing the new tri-coordinates into Eq. 2 becomes: The VHDL implementation will be based now on Eq. 5. Mapping to affine coordinates Projective After completion of the successive operations of addition and multiplication, back to two affine coordinates as follows: In order to make the different computations, the Montgomery Point doubling and Montgomery Point addition algorithms are used, mainly through the ingenious observation of Montgomery, which states that the Y coordinate does not participate into the computations and can be delayed to the final stage [20] . Thus, back to working with only two projective coordinates.
Let us consider the points P(X 1 ,Y 1 ,Z 1 ), R(X 2 ,Y 2 ,Z 2 ), Q(X 3 ,Y 3 ,Z 3 ), belonging to the curve E(GF(2 163 )), where R = 2 × P and Q = P+R, the computations become, through the use of Montgomery method respectively as follows: • The Montgomery point doubling algorithm: (MontgDouble): Requiring, 4 field squaring operations, 2 field multiplications and one simple field addition • The Montgomery Point addition algorithm: (MontgAdd): Requiring, 1 field squaring operations, 4 field multiplications and two simple field additions For the hardware implementation issue, k is represented on an m bits register, as: Both Eq. 7 and 8 are used in the Eq. 1 using the scalar Montgomery multiplication algorithm as shown in Fig. 1.
The inversion in GF (2 163 ), required at the final stage, could be realized in one of the two known methods, either via the Extended Euclidean algorithm, or by the Fermat's theorem which states that knowing after proof that:  Thus, in order to compute the inverse of one element in GF(2 163 ), one needs to take the power of this element (2 163 -2) times.
By Using the Itoh-Tsjuii algorithm based on the add and multiply method leads to realize the inverse as presented in (Table 3) [21] .

FPGA implementation:
The 163 bits ECC component has been developed using the VHDL language. The different components forming the design are as follows: • A 163 bits adder which is a simple 163 '2 bits' Xors • A 163 bits modulo which is a xor-array evaluated through a Matlab script as an input-output matrix, through polynomial reduction using the National Institute of Standards and Technology (NIST) proposed polynomial P(x) = x 163 +x 7 +x 6 +x 3 +1 [22] • A 163 bits squarer that has also been generated from a Matlab script • A 163 bits modified version of the Karatsuba-Offman multiplier circuit that is a based on splitting the operands into 3 identical operands [High (H), Middle (M) and Low(L) bits], the 'L-M-H' multiplier starts with a basic a 7 bits multiplier, leading to the following tree: 7→19→57→163 bits multipliers • A Galois inverter circuit requiring 21 power squaring and 9 field multiplications within only 32 cycles The ECC block diagram implementation is shown in Fig. 2.

RESULTS
In (Table 4), we present the respective estimated number of cycles, required for each part of the algorithm of Fig. 1, at each stage of the FSM controller.
The occurrences of the different basic operations required in all the FSM stages are listed in (Table 5).
Our main contribution concerned the execution of any basic field operation in just one cycle, taking into account, that lost cycles may occur, when the input of any component is back-propagated into the itself in the next iteration; and the use of non-clocked components, reducing the overall amount of clock driving and registered inputs/outputs. The total number of cycles is equal to: The speed of the implementation is based on the target device family, mainly those having enough slices and Input/Output pads. Our design is implemented on the xc5vsx95t-3f1136, with the following parameters: • Goals of optimization set to speed • Optimization effort set to high • Global Optimization Goal set to AllClockNets The partition design summary is presented in (Table 6), while (Table 7) shows the inverse circuit design summary, which shares with the full design the field multiplier, the field squarer and the field adder.
Our architecture has out-performed one of the best architectures [6] , as shown in (Table 8), the performance (column 2) represents the required time for one complete scalar multiplication (over all the 163 bits of the scalar k), as per Eq. 1.    Benchmark tests: Working with 163 bits and 2 163 order numbers or more, is not a direct way implementation, even checking of the results is very cumbersome, in this matter, different Matlab scripts with similar input/output behavior to the VHDL programming have been written, in order to compare the execution steps, as well the final results, timing is not taken into consideration in this specific stage (Emulation style process). The benchmark tests have been done with the inputs of ( Table 9) (in hexadecimal format) [22] . Fig. 3 and 4 show the intermediate results, obtained from the hardware simulator Modelsim, through different steps of the scalar multiplication as indicated by the "k_counter" value (5th line of the Fig. 3 and 4).  [23] 19.5500 153.900 Smyth et al. [24] 3720.0000 166.000 Sozzani et al. [25] 30.0000 416.700 Satoh and Takano [26] 190.0000 510.200 Sakiyama et al. [6 ] 12.0000 555.600 Present study (Fastest) 6.1799 561.136    Table 10 show the output results of the ECC scalar multiplication for a "163 bits" arbitrary value of k.
The implementation was intensively tested for different inputs of k and the obtained results were compared each time with the outputs of the different scripts written within Matlab. Both, hardware implementation and software emulation generated the same results over the 163 bits.

DISCUSSION
The main contribution of present research concerned three major points: • An optimal Finite State Machine (FSM) controlling the whole components, minimizing empty cycles. • Optimization of the hardware inversion process, by reducing the number of different squaring from 162-21, leading to an inversion in just 32 cycles • Separation of the data path routing from the control part, in order to modify only the multiplier, the squarer, the adder as well as the modulo component for the different curves of ( Table 1) The introduction of additional multipliers and squarer's can speed up the design at the expense of hardware spreading inside the FPGA. In embedded processes, this choice "speed/space" is crucial, depending mainly on the type of application, the targeted space, the possible addition of extra future functions and finally the cost allocated to the project.
The results, we obtained are very encouraging and will impact our decision on the embedding of larger encryption schemes, mainly the extension to the NIST proposed curves (193, 233, 283, 409 and 571) in a single FPGA, taking into account: The use of two or more multipliers (tuned parallel design), the use of internal memories such as Block RAMs (optimized timing memory accesses), the speed up of the FSM, as well as using different ECC hardware algorithms…; these optimization schemes are constrained to minimize the parallel inputs of the design and reduce routing circuitry, that dramatically decrease efficiency, lower speed and increase power consumption.

CONCLUSION
We have presented a fast version of an ECC crypto-hardware based on a finite state machine, implemented on a XILINX FPGA xc5vsx95t-1136 device. We attained a frequency of 561.136 MHz, which allows the execution of 161786 scalar multiplications per sec. Compared to the remarkable research of [6] , with its 555.6 MHz, allowing 80000 scalar multiplications.
Our implementation can be still more competitive while introducing more optimization at the level of the multiplier and the squaring components.
The second main optimization, in present research, concerned the modular inverse circuit; which attained the frequency of 777.341 MHz; that is 13 times faster than the implementation of [27] , against an increase, from our side, of 1:2 in the number of slices, for the 163 bits operands.