HARDWARE REALIZATION OF HIGH SPEED ELLIPTIC CURVE POINT MULTIPLICATION USING PRECOMPUTATION OVER GF(p)

Two new theoretical approaches for the hardware realization of high speed elliptic curve point multiplication over a prime field (GF(p)) are presented. These hardware implementations use multiple units of elliptic curve point doublers, point adders and multiplexers. The modular hardware approach used here provides high speed and scalability.

Our objective is to generate the scalar product kP where P is a point on an elliptic curve over a prime field Fp and k is an integer that belongs to Zp. We propose a fast hardware solution to PM which makes use of hardware Point Doubler (PD) and Point Adder (PA) modules. We describe two different schemes for fast multiplication. In the first method we relize the design for a 't' bit k. Then we extend the design for binary multiples of 't'. In the second method multi scalar multiplication is used and the desired result is selected using appropriate multiplexers.

BASIC SYMBOLS AND NOTATIONS
Let the given scalar multiplier k be represented in binary as Equation (1): Here, t is the number of bits of k. That is the size of k when stored in binary is t bits. In terms of these bits k is given by: In the light of Equation (2), the product kP can be expressed as Equation (3): ( ) t 1 t 2 2 t 1 t 2 2 1 0 t 1 t 2 2 t 1 t 2 2 1 0 kP 2 k 2 k 2 k 2k k P 2 Pk 2 Pk 2 Pk 2Pk k P That is Equation (4 and 5): Where: B i = 2 i P for i = 0, 1, 2…,t-1 (5) Science Publications

Realization of B i k i (First Method)
Consider the term B i k i . The bit k i can be either zero or 1. Therefore, multiplication by k i can be represented as Equation (6) From Equation (5) B i k i is equivalent to the logical AND operation as Equation (7): The elliptic curve point B i belonging to the prime field F p has two components as: where the size of each is m bits. m is given by Equation (8): Thus the size of B i is 2m.
In the hardware realization, (B i AND k i ) can be realized using an array of 2m AND-gates. We can also make use of a 2m input Controlled Buffer (CB) with an enable control input EB as shown in Fig. 1. When EB = 0, the output is zero (2m bits) and when EB = 1, output = B i = 2 i P. Therefore, the Controlled Buffer (CB) realizes Equation (6) are shown in Fig. 1.

Realization of kP
From Equation (3) for kP, we can see that kP is obtained as a series of Point Additions. Our aim is to get kP using several 2 input Point Adders. To realize this, Equation (3) for kP is written as: The RHS of this Equation (9) is grouped as follows: Then, kP can be obtained as the cumulative sum of 2input Point Adders. To get that, let us introduce the symbols Q 1 , Q 2 ,…Q t-1 as follows Equation (10-13): That is Equation (14): for i = 1,2,…, t-1.
From Equation (10, 11) substituting for Q 1 : Similarly, from Equation (12) and (15): In this way, we can see that: The RHS of Equation (17) is same as kP as given by Equation (3) Thus kP is realized as the Point Sum of Q t-2 and 2 t-1 Pk t-1 . That is Equation (18)

HARDWARE REALIZATION FOR AN 't'-BIT 'k'
The elliptical curve Point Multiplier is realized as shown in Fig. 2. Output kP is obtained as the Point Sum of the last Point Adder in a chain of (t-1) Point Adders. In Fig. 2, t cascaded Point Doublers (PD's) are used to generate 2P, 4P,…, 2 t-1 P, 2 t P. The Controlled Buffers are denoted by CB in Fig. 2. They generate 2 i Pk i for i = 0 to (t-1). The bit k i of K and 2 i P are the inputs to the corresponding CB. The output of each CB is one of the inputs to the corresponding Point Adder (PA). Equation (10) is realized by Point Adder PA 1 . Similarly PA 2 realizes Q 2 as in Equation (11). The last Point Adder PA t-1 realizes Equation (13) to give out Q t-1 which is the desired output kP itself. The Point Multiplication Module (PMM) provides an additional output 2 t P from the last PD block. This output 2 t P is used for cascading purpose which will be described later.
The Point Doubling and Addition can be accomplished internally in either affine or projective co-ordinates. In the PMM described in Fig. 2, if say bit k i = 0, we cannot avoid Point Adder PA i because, next Science Publications JCS time, k i may not be zero. The number of Point Adders is fixed at (t-1) to take care of all possible value of k. Therefore, the use of Non Adjacent Form (NAF) representation of k has no benefit in this scheme.

Timing Analysis of the Proposed PMM
The running time of the PMM shown in Fig. 2 is determined in terms of the running times of Point doublers and Point Adders. All PD's are similar in structure and working and so also all PA's are similar. Let D be the time (in an appropriate unit) required by a PD to complete the doubling action and let A be the time required by a PA for Addition. D and A depend on the internal design of the PD's and PA's respectively (Hankerson et al., 2004;Ding et al., 2013).

Precomputation
Here, P, 2P,..,2 t P are precomputed and readily available at the corresponding locations. Now, we need not consider the time taken by PD's. Consider the time required to get the output Q 1 from PA 1 after applying the input k. Here, inputs k 0 , k 1 , k 2 ,…,k t-1 are applied simultaneously from a single register holding k. Time taken for signals to pass through CB's are neglected compared to the time needed at PA's. Initially, P, 2P, k 0 and k 1 are available at say T 0 . Neglecting the time taken by CB's, Pk 0 and 2Pk 1 are available at the input of PA 1 at T 0 itself. Therefore, the output Q 1 will be ready at T 0 +A where A is the time required to generate the output by PA 1 . Thus the transition delay at PA 1 is A units of time. After Q 1 is ready, time required by PA 2 to process Q 1 and 4Pk 2 to get Q 2 would be again A. Therefore the total time from T 0 up to the time of getting Q 2 would be A+A = 2A. Thus each PA along the chain adds a delay of A and the total delay would be (t-1) A to get kP. Observe that there are (t-1) Point Adders in the additive chain. Therefore the total running time T 1 is given by Equation (19):

Point Multiplication Module
The Point Multiplication hardware using multiple PD's and PA's can be represented by a modular block as shown in Fig. 3. The module is called as PMM t which stands for Point Multiplier Module that gives kP where 't' is the size of 'k' in bits. Thus PMM 8 means, the size of 'k' is 8 bits. P is the given elliptic curve point of total size 2m.

POINT MULTIPLIER MODULES IN CASCADE
When 't' is large, the number of PD's and PA's in PMM t would also be large. The design and construction of such a large sized Point Multiplier Module becomes practically difficult and can be cumbersome. Therefore, when 't' is large, several smaller sized Point Multiplier Modules are cascaded to realize kP as follows.
Let the smaller size chosen be w bits. The binary representation of 'k' is partitioned into 'd' words of size 'w' bits each. The value of 'd' is given by Equation (20): If 't' is not perfectly divisible by 'w', binary representation of 'k' is padded with d*w-t zeros on the left hand side (De Dormole and Quisquater, 2007). The partition of 'k' into 'd' words is shown in Fig. 4. Let K 0 , K 1 ,..., K d-1 be the decomposed binary words of 'k'. Now 'k' can be expressed in terms of K d-1 ,…, K 1 , K 0 in base 2 w as Equation (21) (Shivkumar and Umamaheswari, 2014): The numerical value of 'k' in terms of K d-1 ,…, K 1 , K 0 can be expressed as: Now, in the light of Equation (22), the product kP can be written as Equation (23): , , 2 PK 2 PK PK = + … + + + That is Equation (24 and 25): Where: K 0 , K 1 , …, K d-1 are of size w bits each and the RHS of Equation (23) has d terms. Therefore, d number of cascaded PMM w 's can realize Equation (23) to get kP as shown in Fig. 5. Equation (23) can be expressed in terms of partial sums S 1 , S 2 , …,S d-1 as follows Equation (26-28): From Equation (28, 23) we see that Equation (29): Realization of these partial sums is shown in Fig. 5.
Pi's are realized as P i = 2 w P i-1 for 0≤ i ≤ (d-1) with P 0 = P. The output of Point Adder PA 1 is S 1 . The inputs to get S 1 are P 1 K 1 and P 0 K 0 . The inputs to get S 2 are P 2 K 2 and S 1 and so on. The output of the last Point Adder gives S d-1 which is same as kP. Additional output 2 dw P can be used for further cascading.

Total Number of PD's and PA's in Cascaded PMM
Each PMMW uses (w-1) number of PA's and 'w' number of PD's. There are 'd' number of PMM W 's and (d-1) number of PA's in Fig. 5. Therefore the total number of PD's is dw which is equal to 't'. The number of PA's is d (w-1) + (d-1) = dw-1 = t-1. Since 't' the size of 'k' can go up to m (size of p), the number of PD's is m and the number of PA's is (m-1).

Timing Analysis of the Cascaded PMM
The timing analysis of the cascaded PMM's is determined with precomputation of 2 W P for w=1,2,…etc.

Precomputation
All inputs K 0 , K 1 ,…, K d-1 are applied simultaneously. Consider the inputs K 0 P 0 and K 1 P 1 to PA 1 in Fig. 5. The delay due to PMM w (1), the PMM w identified by 1 in Fig. 5, for signal K 0 P 0 is (w-1)A as given by Equation (19).
Thus equation (30) and (31) Similarly: Therefore, both of them are available after a delay of (w-1) A at the input of PA 1 . Therefore the delay of S 1 is, delay (S 1 ) = of Equation (32): Now, consider the inputs to PA 2 which are S 1 and K 2 P 2 . Delay of K 2 P 2 due to PMM W (2) is Equation (33): From Equation (32 and 33), both S 1 and K 2 P 2 are available at the input of PA 2 after a delay of wA. To this, adding the delay in PA 2 , we get Equation (34): In this way, each PA in the adder chain adds a delay of A. Thus (d-1) PA's add a delay of (d-1) A. Initial delay at the input of PA 1 is (w-1) A. Hence the total delay of S d-1 is: Therefore the total delay of signal kP is Equation (35):

Register Size Requirement for PMM W
In PMM W , let the size of P be N-bits Equation (36-38): W Then the size of 2 * P will be N W + 2 W the size of 2 * P will be N 2 * W + …………………………………. dW the size of 2 * P will be N d * W + Here N is the NIST standard Value for ECC. Thus, the sizes will be increasing progressively for each succeeding stage and this should be taken care off during the realization of the modules. However, except for the register sizes the modules are similar.

BASIC PRINCIPLE FOR AN 8-BIT 'k' WITHOUT CB (SECOND METHOD)
Let P be a given point on the elliptic curve E (F p ). Let 'k' be an 8 bit integer belonging to Z p . The objective is to generate kP as fast as possible.

Precomputation
Assuming that P is known in advance, we precompute 2P, 4P,…,128P using the Point Doublers. We also precompute the following additive terms using Point Adders as shown in Fig. 6. 3 P using P + 2P, 12P using 4P + 8P, 48P using 16P + 32P and 192P using 64P + 128P. In Fig. 6, PD is a Point Doubler and PA is a Point Adder. The hardware precomputation module uses 8 PD's and 4 PA's. These precomputed values are used later in the realization of the fast multiplier. The last PD in Fig. 6 generates 256P. This value is needed for further concatenation which will be described later.

Expression for kP
The 8 bit integer k is written in binary as Equation (39) The decimal value of k in terms of its binary digits is Equation (40) This can be rewritten as Equation (47): Now let us consider Q 0 as given by Equation (45) When k 1 and k 1, Q 2P P 3P This can be written in a tabular form as shown in Table 1.
From Equation (48) and Table 1, we see that Q 0 can be realized as the output of a 4×1 multiplexer with inputs 0, P, 2P and 3P as shown in Fig. 7. Here, k 1 and k 0 are binary inputs. 2P and 3P are the precomputed values of P as shown in Fig. 6.
Similar to as in Fig. 6, three more multiplexers are used to generate Q 1 , Q 2 and Q 3 as shown in Fig. 8.

Timing Analysis of the FPM
In the FPM circuit of Fig. 8, all the 8 bits of the multiplier k are applied simultaneously to the multiplexers at say T 0 = 0. It is presumed that all the input signals to the multiplexers are readily available before T 0 . Hence we have to calculate the time delay due to multiplexers and Point Adders. Compared to the running time of a Point Adder, the time delay in a multiplexer, which is a combinational circuit, is negligibly small. Therefore we neglect the delay in multiplexers and we assume that the outputs of the multiplexers Q 0 , Q 1 , Q 2 and Q 3 are available to the input of adders at T 0 = 0.
Let the input output transition time delay in the Point Adder PA 1 be A in appropriate time units. The Point Adders are similar in design and construction, and therefore time delays are also same. That is the input output transition time delay of each Point Adder is take as A. For PA 1 , the output (Q 0 +Q 1 ) would be available after a delay of A. This can be expressed as Equation (49): Similarly Equation (

JCS
Therefore the inputs (Q 0 + Q 1 ) and (Q 2 + Q 3 ) to PA 3 are simultaneously available with a delay of A. To this, we add the delay in PA 3 to get the final delay as Equation (51): Thus, for an 8 bit FPM unit, the delay is 2A.

Comparison with a Conventional Multiplier
Consider the Right-to-left binary method of Point Multiplication (De Dormole and Quisquater, 2007) With precomputation, the doubling time is eliminated and the running time is nA where n is the hamming weight of k, that is the number of 1's in k. when the size of k is 8 bits, the maximum value of n is 8. Therefore the worst case delay in the conventional method is 8A and the average case is 4A. In our method, the running time is 2A. Thus the speed of our method is twice that of the conventional method.

Hardware/Complexity
The inputs to each multiplexer are 4 elliptic curve points. Each point has two co-ordinates (x, y). The maximum size of each component is given by m, where Equation (52): Here p is the prime number of the prime field F p . Therefore the size of each point is 2m. Hence the total number of signals at the input of each multiplexer will be 4× (2m) = 8m. The 0 input to the multiplexer Fig. 7 can be eliminated because it is a constant and zero. Hence, externally 3 inputs of size 2m each have to be considered. Then the overall number of input signals would be 3×(2m) = 6m. For a 160 bit p, the size of inputs to a multiplexer would be 6×160 = 960 and the size of the output would be 2×160 = 320.

Fast Point Multiplication Module
The Fast Point Multiplication shown in Fig. 8 along with the precomputation hardware can be represented by a modular block as shown in Fig. 9. The module is called as FPM 8 which stands for Fast Point Multiplier Module for 8 bit sized 'k' that gives output kP with inputs 'P' and 'k'. The module is shown in Fig. 9. The additional output 2 8 P is for concatenation.

CONCATENATION OF FAST POINT MULTIPLIER MODULES
When the size of 'k' is large, the 8-bit FPM8's can be concatenated to realize kP for large sized k. In the realization shown in Fig. 10, the size of 'k' is 32 bits which is expressed in base 256 format as Equation (53): Here, K 3 , K 2 , K 1 and K 0 are 8 bit each. K 0 is the LSB and K 3 is the MSB. The value of k is given by Equation (54) Therefore kP is given by Equation (55) In our scheme, 256P, 256 2 P, 256 3 P and 256 4 P are pre-computed and readily available as shown in Fig. 10. Let us designate these values by the symbols P 0 , P 1 , P 2 and P 3 as Equation ( Thus, kP is realized as the sum of S 1 and S 2 where: and: ( ) Equation (61-63) are realized using three point adders as shown in Fig. 10. In the circuit of Fig. 10, signals K 0 , K 1 , K 2 and K 3 are applied simultaneously.

Latency of Signal kP for FPM 8 and FPM 32
From Equation (51), we know that the latency of each FPM 8 is 2A. Therefore the latencies of P 1 K 1 and P 0 K 0 are Equation (64): To this, the latency of PA-1 is added to get Equation (65): Similarly, the latency of S 2 is Equation (66): Therefore the latency of kP = S 2 +S 1 is Equation (67): Thus, for a 32 bit Fast Point Multiplier (FPM 32 ) the overall latency is 4A.

Extension of FPM's to 128 Bits and 256 Bits
The hardware presented in Fig. 10 realizes kP for a 32 bit k. This circuit can be called FPM 32 . Similar to the circuit of Fig. 10, four FPM 32 's can be concatenated to realize FPM 128 m which realizes kP with the size of 'k' equals 128 bits. The latency of this would be 4A + 2A = 6A. Similarly, four such FPM 128's can be concatenated to get FPM 512 which can give out kP with a 512 bit k. Here, the latency would be 6A+2A = 8A.

Register Size Requirement for FPM 8 and FPM 32
In FPM 8 , let the size of P be N-bits Equation (68-71): Then the size of 256P will be N 8 + Here N is the NIST standard Value for ECC. Here also, the sizes will be increasing progressively. In the case of FPM 32 , the register sizes are calculated similarly and implemented.

COMPARISON WITH OTHER METHODS
In our proposed hardware realization, a large number of PD's and PA's are used. Since the PD modules used are identical in design and characteristics, it is easy to replicate and integrate them. Similarly, PA modules can be replicated and integrated. This makes the Field Programmable Gate Array (FPGA) implementation of Elliptic curve point multiplication easy and efficient. This type of modular approach has not been attempted earlier. Same holds good for Controlled Buffers. In our method, the number of PD's and PA's used are m and (m-1) respectively which are relatively large. For example the NIST standard for m specifies one of the values from the set {192, 224, 256, 384 and 521}. In our method, all the bits of k are applied simultaneously. Thereby shifting of the bits of k one at a time is avoided (Kumar, 2006;Schinianakis et al., 2009;Portilla et al., 2010;Jacob et al., 2013). This saves 't' clock cycles of time where 't' is the size of 'k' in bits.

CONCLUSION
Two new theoretical hardware modules for Elliptic Curve Point Multiplication are described. PMM W 's and FPM 8 's provide fast multiplication and they can be easily cascaded to realize point multiplication for larger values of k. PMM W (w = 8) and FPM 8 use available Point Addition and Point Doubling sub modules, Therefore our proposed methods are faster compared to the conventional methods. Compared to the first method PMM 8 , the second method FPM 8 is faster, even though it requires more register space. From these modules we can create a macro model for realization of elliptic curve point multipliers for very large k. The techniques described can be modified for Point Multiplication over binary field.
These methods require large register spaces for storing the precomputed products of P as discussed in section 4.3 and 7.3. But the modules are similar and can be easily replicated.
Our proposed scalar multiplication modules are easily scalable and can be used independently or as sub modules in an elliptic curve crypto system.

JCS
In future, Fast Elliptic Curve Point Multiplication using Balanced Ternary Representation and Precomputation over GF(p) can be investigated. The existing investigation can be extended to address varied design parameters like speed, power and area.

ACKNOWLEDGEMENT
The researcher would like to thank the Chairman Dr. R N Shetty, Director Dr. H N Shivashankar, Principal Dr. M K Venkatesha of RNS Institute of Technology for their constant support and encouragement and also to thank her professor N Bhaskara Rao, for his guidance and helpful comments in the study.