Optimizing {0, 1, 3}-NAF Recoding Algorithm Using Block-Method Technique in Elliptic Curve Cryptosystem

: The most expensive and time-consuming operation in Elliptic curve cryptosystem is scalar multiplication operation. Optimization of scalar multiplication will substantially enhance the ECC performance. Scalar multiplication can be improved by using an enhanced scalar recoding algorithm that can decrease the number of operations in the scalar representation process. The objective of this research is to introduce an efficient design and implementation of {0,1,3}-NAF scalar recoding algorithm by applying block method technique. The base algorithm has a complex look up table. With block method application on base algorithm, a complex look up table is undesired. Instead a fix look up table is introduced with less computation required for recoding. The Big-O notation is used to measure the complexity and ( µ s ) used to evaluate the running time of base and proposed algorithm.


Introduction
The efficiency of scalar multiplication operation has direct effects on the performance of ECC (Kodali et al., 2013). Scalar multiplication involves with three levels of computations: Scalar arithmetic, point arithmetic and field arithmetic which are demonstrated in Fig. 1.

Level 1: Scalar Arithmetic
This computational level involves with scalar representation and scalar recoding technique. In order to recode the scalar into selected number representation the scalar recoding technique is required. This technique should reduce the hamming weight of scalar k.
According to literature, there are different bases to represent scalar k. Fundamentally, base 2 is known as the natural representation. Scalar k in binary and NAF is represented in this base. Joye and Yen (2002) used different base to represent scalar k. Reducing the Hamming weight will enhance the scalar multiplication performance, since less addition and doubling is required (Shah et al., 2010). Scalar recoding technique is used to recode a scalar k into different representation with less Hamming weight (Yasin et al., 2014). The result of this recoding can have the same magnitude to the scalar or lesser.  (Longa and Gebotys, 2010). From elliptic curve, points can be obtained. The Koblitz curve is a special family curve in which point multiplication is considerably faster than generic curve (Sakthivel and Nedunchezhian, 2012).

Level 3: Field Arithmetic
Binary and prime field has different ways and cost of process. Point operations are executed by utilizing the finite field operation. The impact of efficiency of this level is essential (Hitchcock et al., 2003;Morales-Sandoval and Feregrino-Uribe, 2006). Inversion is the most expensive, followed by the multiplication cost and then the squaring cost .
Among all these operations scalar arithmetic level is the most expensive operation (Aranha et al., 2012) and according to literature (Hankerson et al., 2003) enhancement in this operation will significantly increase the efficiency of ECC.
Accordingly, improving the first two levels will lead to significant increment in efficiency of scalar multiplication. Scalar recoding algorithm can be improved by employing an enhanced scalar recoding algorithm that can decrease the number of operations and required less running time in the scalar representation process (Bafandehkar et al., 2015).
Thus, in this study we focus on optimizing the scalar arithmetic algorithm and this objective c`n be achieved by proposing a new algorithm with less algorithm complexity.

{0, 1, 3}-NAF Algorithm
Md Yasin (2011) introduced a scalar representation algorithm to convert binary expansion into {0, 1, 3}-NAF. This recoding algorithm is specifically for binary numbers that have adjacent nonzero digits in its representation. This scalar representation is in base 2 using digit 0, 1 and 3. The special NAF property is adopted in this scalar representation. This recoding method is in left-to-right mode based on the technique proposed by Joye and Yen (2000). A look-up table has been proposed to simplify the recoding technique. The non-adjacency property for each row of the look-up table is proven by using the same technique used by Joye and Yen (2000).
This method is real-time operation. It has homogeneous approach to real-time recoding so the recoded scalar can be used straight away for scalar multiplication algorithm. This is possible because scanning digits of the scalar for recoding and scalar multiplication are done using the same mode, which is from left-to-right. In the literature, this type of recoding promotes better memory usage and mostly preferred for memory constraint devices (Khabbazian et al., 2005). Whereas, in heterogeneous approach, the recoded digits are saved before it is used in the scalarmultiplication algorithm. This is because scanning digits of the scalar for recoding and scalar multiplication is initiated from different directions, that is, the recoding mode is right-to-left and the scalar multiplication mode is left-to-right. Generally, this type of recoding needs an additional n-bit RAM for storage, where n is the bit size of the scalar. This is one of the advantage of {0, 1, 3}-NAF algorithm over other recoding algorithms which scan and recode from right to left. Figure 2 shows a flowchart illustrates the steps of {0, 1, 3}-NAF algorithm.
Note: X = 0 or 1 and where ⌊⌋ denotes a floor function that will give the largest integer less than or equal to ((bi +1 +r i +r i-1 )/2)  Reitwiesner (1960) shows NAF representation with radix r = 2, where each digit in the NAF, a i ∈ {−1,0,1} must satisfy, a i , a i+1 = 0, for all i ≥ 0. The NAF can also be written as ((an −1 ...a 0 ) NAF . The NAF is unique with an average Hamming weight of 3 l where l is the bit length of the NAF representation. In the literature, the NAF is commonly used and efficient for elliptic curve scalar multiplication. Traditional NAF is also utilizing left-to-right scanning method which reduce the complexity load of algorithm.

Fig. 3. Flowchart of block method algorithm
In the other hand Pathak and Shanghi (2010) introduced a blocking technique to improve the NAF conversion utilizing a fix look up table. The look up table contains equivalent value for each binary numbers in NAF representation form. Since the given binary number will be partitioned in blocks of 8 bits length so the combination of 2 8 which is equal to 256 numbers in the table is required. Therefore the table will contains the equivalent value for numbers starting from 0 to 255. After partitioning the given binary number and replacing the equivalent value of each block form table, the blocks must be combined together to compute the final result. They conclude utilizing this technique will reduce the number of iteration in traditional NAF. The Fig. 3 shows the flowchart which illustrates the steps taken for block method algorithm.
In this method, in order to compute the equivalent value in NAF for the given X = (1000111011010)2 using Block method, the following steps will be taken: • Partition the input Binary number into N blocks of 8 bits from right side, pad '0' digit to complete the 8 bits block • Represent each binary block's in its equivalent {-1, 0, 1}-NAF representation from Table 2 • Combining the blocks the boundary addition with Most Significant Bit (MSB) of lower block and Least Significant Bit (LSB) of upper block must perform to get the final NAF

The Proposed Method
The proposed method begins with creating a look up table contains 256 numbers starting from 0, with equivalent value in {0, 1, 3}-NAF representation. The other properties of this table are similar with the Table 2.
The focus of this research is to improve the performance of {0, 1, 3}-NAF recoding algorithm (Md Yasin, 2011) by applying blocking technique introduced by Pathak and Shanghi (2010). In this method a new look up table which contains equivalent value for each binary numbers in {0, 1, 3}-NAF representation form has been set up.
According to Fig. 4 the given binary number will be partitioned in blocks of 8 bits length. Therefore we need the combination of 2 8 numbers which is equal to 256 numbers in table. The proposed look up table will contains the equivalent value for decimal numbers starting from 0 to 255.
The Algorithm.1 works as follows; Line 1 initiates i and j variables and set them to 0. The variable m is declares length of input. Line 2 is state an iteration to loop through bits in input r. In line 3 another iteration is defined to loop through 8 bits from left to right and line 4 is checking if the index j is bigger or equal to the length of input, (the index has went through all the bits so) it will terminate the iteration process. Line 7 has define a two dimensional container to hold n partition of input in size of 8 bits. In line 9 a key look up has been defined as a list [r k ] to map the equivalent value of the partition from look up table and store it in r′ k . Line 10 will return the recoded r into {0,1,3}-NAF representation.
To visualize the processes in proposed algorithm, Fig. 5 presents the flowchart of the steps taken for this research.  Figure 5 shows a flowchart to convert binary input data in {0, 1, 3}-NAF representation using the proposed algorithm. For example, for the given X = 4570 10 = (1000111011010) 2 using proposed method, the following steps will be taken: Step 1: Partition the input binary number into N blocks of 8 bits from right side Step 2: Represent each binary block's in its equivalent {0, 1, 3} -NAF using look up Table 3 Step 3: Place each converted block consecutively and respectively The value of final answer in decimal can be computed as follows:

Performance Analysis
The Algorithm analysing is defined as predicting the required resources that the algorithm needs to perform computations. Although computer resources are categorized as memory, communication bandwidth, or computer hardware, the crucial concern is the computational time that must be measured. Commonly, identification of the most efficient algorithm for a problem can be done by analysing several candidate algorithms.
An algorithm running time for a certain input is defined as the number of operations or steps to perform a process. It is more appropriate to set the notion of each step so that the analysing method is more machine independent. To execute each line of a pseudo code, a constant amount of time is required. One line might need a different amount of time than another line, but the assumption is that each execution of the i-th line takes time c i , where c i , is a constant (Cormen et al., 2001).
The experimental results were used to evaluate and validate the performance of the base and proposed algorithm. Two metrics are explained in following sections. These metrics are complexity and running time.

Complexity Analysis
A characteristic of an algorithm described as run time performance and memory usage and expressed in Big-O notation. The efficiency of algorithm has been measured in terms of asymptotic complexity since 1973 (Gilberg and Forouzan, 2004).
The outcome of complexity analysis commonly is the worst-case complexity of an algorithm. But this does not always give reasonable correspondence with running time. For example, a component of an algorithm may be executed many times, each time with a different cost.
Thus, in order to compute the efficiency of algorithm in real environment, the average performance of running time need to be measured (Foster, 1995).

Running Time Analysis
Performance of an algorithm is highly depends on the characteristics of execution environment. Certainly a machine with higher computational power could have better performance. Therefore, the experiment must be carried out on the same environment. As shown in the Fig.  6 benchmarking has been used to compute the running time in microseconds. In this method the start and end point of code snippet must be marked. The difference between this two is the running time of that operation.

Implementation of the Proposed Method
This section discusses the implementation of the proposed algorithm for recoding binary numbers based on {0, 1, 3}-NAF algorithm. There are three stages in proposed algorithm namely:

Stage I: Look up Table Stage
A look up table with the size of 256 integer elements has been generated. These integers are in the range of 0-255. Each integer in this range has a corresponding value to a look up table element in {0,1,3}-NAF representation. As this representation in look up table is in the similar base with {0, 1, 3}-NAF algorithm, the hamming weight is also same with the base algorithm result.

Stage II: Partitioning Stage
A block technique has been designed and developed to divide each input binary to n blocks of 8 bits length. In this method, the partition process will start from the right most side of the input binary. If the most left block is m bits shorter than 8 bits, the left side of this block will be filled by m zeros.

Stage III: Conversion Stage
Every block of binary number has an equivalent value in {0, 1, 3} -NAF. This value has been calculated and it is preserved in the look up table. In this stage each block of binary number will be represented as its {0,1,3}-NAF equivalent and these blocks will be placed consecutively.

System Specification (Test Bed)
A Machine with the below specification has been used to carry out this experiment.

Results
In this section, there are two main analysis, one for the base and another one is for the proposed algorithm. The following analysis have been carried out: • Algorithm complexity to find the significant efficiency to recognize the optimal algorithm • Time performance to compare the real time performance of both algorithms in the same environment As mentioned in (Gilberg and Forouzan, 2004), the most common way to compare the efficiency of two algorithms is to compute the Big O.
Moreover according to (Foster, 1995), in order to obtain an execution profile of an implemented algorithm, the behaviour of program can study. This experiment can assist to measure the outsider effects on the performance time such as initialization time, idle time and required time for each phases of computation.
To attain reliable result, the experiment must be performed in the same machine and with similar condition and to increase the validity of results, it must be repeated for several times. To address the above concerns the same machine with the specifications mentioned in Table 4 has been used. The system has been in the same state and the experiment has been repeated several times to show the difference between both algorithms in different numbers of run times.
The details of each analysis are presented in the following sub-sections.

Complexity Analysis
In this analysis, the complexity of base and proposed algorithm has been computed and compared. The complexity algorithm M is a function, f(n) where the running time required for input data of size n. If algorithm contains no loop, f depends on the number of statements. Else f depends on number of elements being process in the loop (Gilberg and Forouzan, 2004). As it is explained in (Gilberg and Forouzan, 2004), based on the complexity of problem, different time is required. Accordingly the complexity of base and proposed algorithm has been computed and presented in Table 5.   Table 5 the growth rate of the computed mathematical function is proportional to the presented value of the function. Therefore by increasing the input size the number of required operation will increase significantly. Whilst by increasing the size of input, the number of required operations for proposed algorithm has been increased linear. This is due to proposed look up table utilization in which there is no need to check the conditions and has no special cases. Moreover, in order to recode a binary by proposed algorithm the result of each block will consecutively be placed in order and not any operation more than that is required. Lesser number of operation is known as the reason of lesser complexity. Figure 7 compares the growth rate of operations in base and proposed algorithms. It is clear that the growth rate for base algorithm is exponential.
The Big O for {0, 1, 3}-NAF and algorithm has been presented in Table 6.
According to Table 6 and with respect to algorithm efficiency order of magnitude in (Gilberg and Forouzan, 2004), 0(n)<0(n 2 ), the efficiency of proposed algorithm is significant. This complexity optimization in proposed algorithm is expected to cause significant enhancement in running time.

Running Time
In this section, performance time is the run time which is computed using benchmarking method. Read Time Stamp Counter (RDTSC) has been used to obtain the performance time. For applications that require accurate time-stamp counters, this instruction will count the number of processor cycles elapsed. The value returned by RDTSC indicates the number of processor cycles executed and not the number of seconds elapsed. Thus, to get the number of seconds elapsed, the returned value need to be divided by the processor frequency. This process has been applied on the base and proposed algorithm. The result of the performance time computation is analysed in the following sections (Intel Coorporation, 1997).
According to (Senne et al., 2000), both algorithms will run 1, 5, 10 and 20 times. This method has advantage to demonstrate the difference between both algorithms in different number of running. The computed average performance time is represented in Table 7 respectively. Table 7 denotes the details of average performance time for 1, 5, 10 and 20 times run for base and proposed algorithm.
Overall, the performance time has dramatically decreased for proposed algorithm. For 1 time run proposed algorithm shows 87.6% speed up in comparison to base algorithm. According to (Foster, 1995) in order to enhance the accuracy of performance time calculation, the experiment has been repeated several times.
This replication process will actually minimize the effect of background activities in operating system. The performance of proposed algorithm has 86.8% increased for 5 times run against base algorithm. For 10 times run proposed algorithm shows 86.6% speed up in comparison to base algorithm. Whilst for 20 times run proposed algorithm has performed 87.3% times faster than base algorithm.
In average of 5 times run, the performance of base method had risen to just 3276 µs comparing with 1 time run. Although this appears on the graph to be a gentle increase, it is in fact an increase of approximately 4.4%. Average of 10 times run increased by 0.7% compare to 5 times run. However the greatest real interest was in average of 20 times run, where the performance time had increased by approximately 8.1% in comparison to 10 times run. Although performance time in proposed method increased between 1 times run to 20 times run, its increase is steadily. Figure 8 shows the result for {0, 1, 3}-NAF and proposed algorithm. These results are computed from the average performance time for 1, 5, 10 and 20 run times. Overall, the chart shows that proposed method takes ∼10 times less for conversion of same data in comparison with base method.  The base algorithm has only a look up table stage, in which all the conversion computations are performed. In contrast with the base algorithm, the proposed algorithm consist of a partitioning function and a fix look up table. Therefore, the total performance time in base algorithm has been elapsed in its look up table. Whilst in proposed algorithm different functions are defined and each requires different time. Table 8 shows the elapsed time for different functions of the proposed algorithm. This results helps to compute the processing time and the total execution time. As the details indicates the summation of performance time for both functions in every running set of experiment is less than 100% which is the total performance time. The lost time here called elapsed time. This elapsed time is in range of 0.2% to 0.8%. According to (Foster, 1995) this elapsed time has been assumed as initialization of the algorithm or the OS backgrounds activity.
Based on the given information in Table 8, more than 70% of the total performance time in proposed algorithm has been elapsed to perform the partitioning stage of algorithm. From this information it can be concluded that the proposed look up table takes only about 3 l of total required performance time in each execution

Conclusion
The importance of algorithmic optimizations for increasing the performance for cryptography acceleration is proven. On the other hand, the importance of utilizing some pre-computation techniques such as blocking is significant.
A performance time problem in {0,1,3}-NAF recoding algorithm has been stated. A new fix look up table has been generated and the blocking method has been applied. The complexity of base algorithm has been reduced and the performance time has been improved. In order to benchmark a dataset which is composed with all possible hamming weigh with the length of 24 bit binary has been used. As the future research direction and to extend this work, with a minor manipulation on the algorithm, it will take on bigger input size. The effect of resizing the blocks and extending the fixed look up table on performance enhancement can be study.