# FFT in Hardware and Software - University of Calgary FFT in Hardware and Software Background Core Algorithm Original Algorithm, the DFT, O(n2) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation. DFT Computation A summation over the whole input array for every single element in the output array. A VERY computationally inefficient algorithm to implement. X () DFT ( x[n]) x[n]e n jn  FFT Computation A much more computationally efficient algorithm Works using the divide and conquer principle. First developed by Cooley and Tukey in 1965! DFT vs. FFT (Number of Operations) Problem Size (N) Standard DFT FFT % of DFT (smaller is better) (smaller is better) (smaller is better)

2 4 1 25 4 16 4 25 8 64 12 19 16 256 32 13 32 1024 80 8 64 4096 192 5

128 16384 448 3 256 65536 1024 2 512 262144 2304 1 1024 1048576 5120 <1 DFT vs. FFT Thousands 1200 1000 Percent of DFT Computation Time (Smaller is Better) 800 600 30% 400

200 0 0 200 400 600 800 1000 1200 Problem Size Thousands Computations Required Nearly Linear Growth of FFT (Smaller is Better) 6 5 4 Percent of DFT Computation Time Computations Required Exponential Growth of DFT (Smaller is Better) 25% 20% 15% 10% 5% 0% 0 3 200 400

600 Problem Size 2 1 0 0 200 400 600 Problem Size 800 1000 800 1000 1200 FFT Butterfly Operations Butterfly arrangement of computations Repeated on successive pairs of input data Then half as many times on alternating pairs Then half again as many times on every fourth element The Butterfly Simple operations repeated many times xe[n] X[n] WnN xo[n] X[n+N/2] -WnN

W nk N e j 2nk N 8-point FFT Demonstration The Entire Calculation Output Input Array x + + X x + + X x + + X x + + X

x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X

x + + X x + + X x + + X x + + X x + + X x + + X

x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x + + X x + +

X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + +

X x + + X x + + X x + + X x + + X x + + X x + +

X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x + + X x +

+ X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x +

+ X x + + X x + + X x + + X x + + X x + + X x +

+ X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x + + X x

+ + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x

+ + X x + + X x + + X x + + X x + + X x + + X x

+ + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x + + X

x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output

x + + X x + + X x + + X x + + X x + + X x + + X

x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x + +

X x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array

Output x + + X x + + X x + + X x + + X x + + X x + +

X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x + + X x + + X x +

+ X x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition

Why Hardware? Even more speed for FFT Extremely parallelizable A whole layer can be done in two FPGA clock cycles 1 multiply cycle 1 add cycle (Assuming sufficient multipliers) Hardware Problems Complexity Input speed Output speed If the FPGA takes 24.4ns but takes 20s to transfer the input data, what gain is there? i.e. 24.4ns + 20s + 20s = ~40s! Mitigation of Hardware Problems Use a faster bus AMD Opterons Hypertransport 20.8 GB/s (166.4 Gb/s) per Link (V. 3) Modules that fit into an AMD 64-bit Opteron Socket http:// www.drccomputer.com/pages/modules.html xilinx based module http://www.xtremedatainc.com/xd1000_brief.html - altera based module Mitigation of Hardware Problems Put the FPGA on the die with the DSP Need silicon vendor support FPGA can access memory on a very wide bus (i.e. 128 bits per cycle) Implement the entire project in FPGA Time consuming to program Possibly insufficient room on the FPGA 8-point FFT Demonstration In Hardware Input Array

Output x + + X x + + X x + + X x + + X x + + X x + + X

x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array Output x + + X x + + X x + + X

x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array

Output x + + X x + + X x + + X x + + X x + + X x + + X

x + + X x + + X Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array Output x + + X x + + X x + + X

x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition Why Not Software? Each butterfly must be done sequentially Only slight parallelism enabled by a DSP

like the TigerSHARC Each Butterfly can be done in 2 cycles (after optimization). Results of Testing Linear Profiling of FFT Algorithm in C++ Stage Cycle count Time 8-point 32-point 256-point 8-point 32-point 256-point Initialization 21 25 25 35.07ns 41.75ns 41.75ns Computation 6922 1135 1.895 s 11.559 s

290.950 s Butterfly 91 174222 151.97ns Results of Testing Profiling of VHDL on FPGA Butterfly takes 24.377ns to execute 62% is computational, 38% is routing on FPGA Product Offerings Most DSP Vendors Many FPGA Vendors (IP Intellectual Property) Microcontroller Vendors (i.e. Blackfin) FFTW The Fastest Fourier Transform in the West AMD Math Core Library Intel Library Highly Optimized for the expected hardware Published Results The Radix 4 version delivers a 1 K points complex processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200MHz system speeds. Additional versions of the new cores are under development.  FFT IP Core Published Results  FFT/IFFT length Texas Instruments C6713 Single 4DSP FFT core

(Smaller is Better) Quad 4DSP FFT core (Smaller is Better) 256 12.3s 3.68s 920ns 512 27.3s 6.24s 1.56s 1024 60.2s 11.4s 2.85s References  Signals Systems and Transforms  James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297301 (1965).  http://www.drccomputer.com/pages/modules.html - xilinx based module  http://www.xtremedatainc.com/xd1000_brief.html altera based module  http://www.amd.com/us-en/Processors/DevelopWithAMD /0,,30_2252_2353,00.html  http://www.us.design-reuse.com/news/news5650.html  http://www.4dsp.com/fft.htm