FFT in Hardware and Software - University of Calgary
FFT in Hardware and Software Background Core Algorithm Original Algorithm, the DFT, O(n2) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation. DFT Computation A summation over the whole input array for every single element in the output array. A VERY computationally inefficient algorithm to implement. X () DFT ( x[n]) x[n]e n jn  FFT Computation A much more computationally efficient algorithm Works using the divide and conquer principle. First developed by Cooley and Tukey in 1965! DFT vs. FFT (Number of Operations) Problem Size (N) Standard DFT FFT % of DFT (smaller is better) (smaller is better) (smaller is better)
128 16384 448 3 256 65536 1024 2 512 262144 2304 1 1024 1048576 5120 <1 DFT vs. FFT Thousands 1200 1000 Percent of DFT Computation Time (Smaller is Better) 800 600 30% 400
200 0 0 200 400 600 800 1000 1200 Problem Size Thousands Computations Required Nearly Linear Growth of FFT (Smaller is Better) 6 5 4 Percent of DFT Computation Time Computations Required Exponential Growth of DFT (Smaller is Better) 25% 20% 15% 10% 5% 0% 0 3 200 400
600 Problem Size 2 1 0 0 200 400 600 Problem Size 800 1000 800 1000 1200 FFT Butterfly Operations Butterfly arrangement of computations Repeated on successive pairs of input data Then half as many times on alternating pairs Then half again as many times on every fourth element The Butterfly Simple operations repeated many times xe[n] X[n] WnN xo[n] X[n+N/2] -WnN
W nk N e j 2nk N 8-point FFT Demonstration The Entire Calculation Output Input Array x + + X x + + X x + + X x + + X
Why Hardware? Even more speed for FFT Extremely parallelizable A whole layer can be done in two FPGA clock cycles 1 multiply cycle 1 add cycle (Assuming sufficient multipliers) Hardware Problems Complexity Input speed Output speed If the FPGA takes 24.4ns but takes 20s to transfer the input data, what gain is there? i.e. 24.4ns + 20s + 20s = ~40s! Mitigation of Hardware Problems Use a faster bus AMD Opterons Hypertransport 20.8 GB/s (166.4 Gb/s) per Link (V. 3) Modules that fit into an AMD 64-bit Opteron Socket http:// www.drccomputer.com/pages/modules.html xilinx based module http://www.xtremedatainc.com/xd1000_brief.html - altera based module Mitigation of Hardware Problems Put the FPGA on the die with the DSP Need silicon vendor support FPGA can access memory on a very wide bus (i.e. 128 bits per cycle) Implement the entire project in FPGA Time consuming to program Possibly insufficient room on the FPGA 8-point FFT Demonstration In Hardware Input Array
x + + X x + + X x + + X x + + X x + + X Multiplication by W factor + Addition Why Not Software? Each butterfly must be done sequentially Only slight parallelism enabled by a DSP
like the TigerSHARC Each Butterfly can be done in 2 cycles (after optimization). Results of Testing Linear Profiling of FFT Algorithm in C++ Stage Cycle count Time 8-point 32-point 256-point 8-point 32-point 256-point Initialization 21 25 25 35.07ns 41.75ns 41.75ns Computation 6922 1135 1.895 s 11.559 s
290.950 s Butterfly 91 174222 151.97ns Results of Testing Profiling of VHDL on FPGA Butterfly takes 24.377ns to execute 62% is computational, 38% is routing on FPGA Product Offerings Most DSP Vendors Many FPGA Vendors (IP Intellectual Property) Microcontroller Vendors (i.e. Blackfin) FFTW The Fastest Fourier Transform in the West AMD Math Core Library Intel Library Highly Optimized for the expected hardware Published Results The Radix 4 version delivers a 1 K points complex processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200MHz system speeds. Additional versions of the new cores are under development.  FFT IP Core Published Results  FFT/IFFT length Texas Instruments C6713 Single 4DSP FFT core
(Smaller is Better) Quad 4DSP FFT core (Smaller is Better) 256 12.3s 3.68s 920ns 512 27.3s 6.24s 1.56s 1024 60.2s 11.4s 2.85s References  Signals Systems and Transforms  James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297301 (1965).  http://www.drccomputer.com/pages/modules.html - xilinx based module  http://www.xtremedatainc.com/xd1000_brief.html altera based module  http://www.amd.com/us-en/Processors/DevelopWithAMD /0,,30_2252_2353,00.html  http://www.us.design-reuse.com/news/news5650.html  http://www.4dsp.com/fft.htm
This is a fluid categorisation - given, for example, the recent dilution of medical professional power related to the rise of corporatism and consumerism. Nonetheless, these differences have significant implications in areas such as: The way knowledge is used in...
Perimeter is the distance aroundthe outside of a figure or shape. Look at my pig pen. Farmer . Torger. needed to know the distance around so he could buy the right amount of fencing to keep me in. 10 +...
The map so. constructed, shows at a glance the whole extent of the United States territory from sea to sea; and in tracing the probable expansion of the human race from east to west, the mind finds an agreeable resting...
This is unlikely due to what we know of its origin. The sunburn or pellagra explanation seems more likely than the anger one. Interestingly, the Afrikaans Rooinek, which literally means redneck, is a disparaging term the Boers used to apply...
- Oxfam staff & refugees, no partner - Oxfam staff & refugees. Process. Water trucker fill any water storage container once per week. No discussion with community. No analysis of water truckers. No assessment of water containers . Poor monitoring....
Ready to download the document? Go ahead and hit continue!