Low Power Hardware Implementation of High Speed FFT Core

: Applications based on Fast Fourier Transform (FFT) such as signal and image processing require high computational power. This paper proposes the implementation of radix-4 based parallel-pipelined Fast Fourier Transform processor which incorporates a low power commutator, butter-fly with multiplier-less architecture. The proposed parallel pipelined architectures have the advantages of high throughput and low power consumption. The multiplier-less architecture uses shift and addition operations to realize complex multiplications.


INTRODUCTION
Fourier transforms play an important role in many digital signal processing applications including speech, signal and image processing. However, direct computation of Discrete Fourier Transform (DFT) requires on the order of 2 N operations where N is the transform size. The FFT algorithm, first explained by Cooley and Tukey [1] , opened a new area in digital signal processing by reducing the order of complexity of DFT from 2 N to 2 log N N .
Parallel-pipelined FFTs are preferred for both high throughput and low power consumption. In real-time applications, input data is a sequential stream. For this reason, the commutator is needed to reorder the input data. The proposed architecture in this paper is an improvement to be power efficient compared to previous commutator architectures used in pipelined FFT [2] . It is well known that the switching power is mainly responsible for power consumption in CMOS circuits. This power, Psw, is given by sw p = 2 1 2 lo ad dd kc v f (1) Where k is the average number of times the gate makes an active transition during one clock cycle, f is the clock frequency, Vdd is the supply voltage and Cload is the load capacitance of the gate. Hence, for achieving low power, one or more of the parameters Cload, Vdd and k need to be minimized. However, since Cload and Vdd are relative to the target technology, k becomes the main point of improvement. In [2] implementation is done by reducing the switching activity only. Therefore, this paper focuses on implementing commutator with no switching activity, hence achieving a significant power saving as compared to previous commutator architectures [3] .

point 2-parallel-pipelined FFT:
The DFT of N complex data points x (n) is defined by : Dragonfly (butterfly) equations 0 (ar + cr) + (br + dr) + j ((ai + ci) + (bi + di)) 1 (ar -cr) + (bi -di) + j ((ai -ci) -(br -dr)) 2 (ar + cr) -(br + dr) + j ((ai + ci) -(bi + di)) 3 (ar -cr) -(bi -di) + j ((ai -ci) + (br -dr)) For k=0, 1, 2, 3 we get 16 equations for computing 16 points [4] . The Radix-4 FFT takes only 64 multiplications and 192 additions for computations. The flow graph of 16-point FFT can be seen in Fig. 1. The number inside the open circle as shown in Fig. 1 represents equations as in Table 1 which is used for  Fig. 2, is 2-parallel-pipelined FFT. It can achieve double the throughput, compared to the pipelined FFT at the same operation frequency. As shown in Fig. 2, the input data are separated into two streams as even and odd and sent to two commutators in stage1. The 4 outputs from two commutators are fed into each simplified butterfly unit. The butterfly unit computes the four equations given in Table 1 in a clock cycle. The coefficients are divided into two responding sections, in terms of even and odd. Two coefficient sections are fed into two complex multipliers, respectively.
In [3] , as shown in Fig. 2 a shuffle unit is needed in Stage2 to implement the interstage data shuffle. Instead of that here registers are used to store the intermediate values. These values are fed to butterfly stage 2. The output is stored in same commutators.
Low power techniques DR commutator: For the commutator, previous implementation approaches include shift register architecture (SR) [5] , conventional dual port RAM architecture (DR) [2] . In this paper, a new architecture based on dual port RAM (DR) with no switching activity (No shifting of data's) is used. Dual Port RAM is reduced to four from six as in [2] and with no MUX. The RTL block diagram of commutator (even and odd) with FSM is depicted in Fig. 3 which reduces the power and area when compared with [2,6] .

Low power butterfly:
The butterfly operation is the heart of the FFT algorithm. It takes data words from memory and computes the FFT. Low Power Butterfly (LB) architecture is employed to replace the conventional butterfly based on adder/subtracters.
Due to the 2's complement arithmetic no separate subtracter is needed for Subtraction. In real-time implementation imaginary part is zero. In Table 1 when '0' is substituted for ai, c i, b i and di . We get simplified equations as in Table 2. Table 2: For butterfly stage 1 0 (ar + cr) + (br + dr) + j 0 1 (ar -cr) -j (br -dr) 2 (ar + cr) -(br + dr) + j 0 3 (ar -cr) + j (br -dr) In [3] 8 clock cycles is needed for Butterfly stage1 as they are computing one equation at a time. As we are computing four equations at a time 2 clock cycles is enough saving 6 clock cycles which increase the performance as well comparing [3] . Equations in Table 1 are used for computing Butterfly stage 2. The block diagram for butterfly stage 1 and stage 2 are shown in Fig. 4 and 5 respectively.

Multiplier-less unit:
In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units (such as adders and multipliers), registers, multiplexers and interconnection wires. In FFTs, the conventional complex multiplier consists of four real multipliers, one adder and one subtracter. However, since coefficients for all stages can be pre-computed, we can apply shift and addition operations with common sub expression sharing to replace those complex multiplications which reduce area as well as power [7] .
For example, the number of coefficients for the first stage of 16-point FFTs is 16. These coefficients are shown in Table 3. The multiplier-less unit as shown in Fig. 10 consist of shift and addition operations with common sub expression sharing to replace complex multiplications. A close observation reveals that the seven coefficients (7fff, 0000) and (0000, 8000) are the trivial coefficients which are the quantized representation for (1, 0) and (0,-1) in 16-bit two's complement format respectively.  In each set, the first entry corresponds to the cosine function (the real part, Wr) and the second one corresponds to the sine function (the imaginary part, Wi). For the trivial coefficients (7fff, 0000) and (0000, 8000), the complex multiplication is not necessary. Data can directly pass through the multiplier unit without any multiplication, when data is multiplied with (7fff, 0000). Only an additional unit, which swaps the real and imaginary parts of input data and inverts the imaginary  Simulation results for commutator block part is needed for those data (0000, 8000).The rest of the coefficients can be represented by three constants (7641, 5a82 and 30fb). For example, a multiplication with the constant a57d could be realized by first multiplying the data with 5a82 and then two's complementing the result. The other two constants (89be and cf04) can be realized in a similar manner, using constants 7641 and 30fb respectively. 5a82 is represented in two's complement format, 7641 and 30fb are represented in Canonical Signed-Digit (CSD) format: 5a82 (0101101010000010), 7641 (1000-10-001000001) and 30fb (010-1000100000-10-1). We can use shifters and adders based on the three constants to carry out those nontrivial complex multiplications as shown below: 5a82X = 5X << 12 + 5X << 9 + 65X <<1 7641X = X << 15 + 65X -5X <<9 30fbX = 65X << 8 -X << 12 -5X Where X means input data. The shift and addition module for the constant 5A82X, 7641X, 30fbX, are shown in Fig. 6-8, respectively.   The common sub expressions for the three constants are 101 (5) and 1000001 (65). The operation required before common sub expression block and after common sub expression block is shown in Table 4 and 5. Figure 9 shows the shift-and-addition module for the three constants in the multiplier-less unit. Totally, 11 adders are used to compose the shift-and-addition module. In the multiplier-less unit, 22 adders substitute the four real multipliers in the complex multiplier unit.

Simulation results using modelsim tool:
The FFT blocks are simulated and the results are shown below using Modelsim Tool in VHDL. VHDL is a programming language that has been designed and optimized for describing the behavior of digital systems.VHDL has many features appropriate for describing the behavior of electronic components ranging from simple logic gates to complete microprocessors and custom chips. Features of VHDL allow electrical aspects of circuit behavior (such as rise and fall times of signals, delays through gates and functional operation) to be precisely described. The resulting VHDL simulation models can then be used as building blocks in larger circuits (using schematics, block diagrams or systemlevel VHDL descriptions) for the purpose of simulation.
The Simulation result for Commutator Block is shown in Fig. 11. The Simulation result for FFT module is shown in Fig. 12. Only eight cycles are used for transform calculation. Along with data load twenty four cycles are used.

Timing analysis:
The proposed project 16-point FFTs are synthesized at 1.2ns clock cycles to maximize timing  [3] as shown in Table 8.
FFT core details: The FFT core computes for 16 points with the speed of 833.333 MHz. Table .9 shows the detailed information about the FFT architecture.

Area report:
The area report is shown in

CONCLUSION
In this project a parallel pipelined architecture for 16 point radix-4 DIF FFT in fixed point representation is proposed and implemented. Several novel low power techniques: multiplier-less, DR commutator and LB butterfly are implemented. Based on the combination of above mentioned techniques low power can be achieved without transferring data between RAMs and also through maintaining the unused outputs of RAMs at their previous values in the commutator block.
The commutator block in this scheme achieved only 5.828 mw which is reduced to half compared to previous work (13mw power). This commutator reduces the number of write operations to memory blocks. Low power FFT processor is implemented by using multiplier less (shift add) approach for multiplying twiddle coefficient.
This project presents a novel multiplier-less parallel pipelined FFT processor architecture suitable for shorter FFTs. This design approach can also be applied for the last stages of longer FFTs. The multiplier-less architecture employs the minimum number of shift and addition operations to realize the complex multiplications. This reduces the power consumption of the multiplier by half.
The parameterization impact on power /speed performance has been compared. Up to 52% power savings were achieved, as compared to 16-point R4SDC pipelined FFTs. Previous papers have implemented for 32 bits (one word) complex data whereas our scheme done for 64 bits complex data and achieved 30.32 mw power. These IP cores can also achieve up to 833 MHz sample rate which is 3.3 times greater than [2] which run for 250 MHz.