Custom Code Generation for Soft Processors Martin Labrecque Peter Yiannacouras Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL Soft Processor: Processor in FPGA FPGA Zero Test Instr 15:0 20:0 P C 25:21 regA 20:16 regB Instr. Mem.
4:0 25:21 Data Mem. Xtnd << 2 datW 20:13 +4 datIn Xtnd regW datA Reg. Array datB datOut addr aluA
ALU aluB Wdest IncrPC Wdata Programmable Logic Processor Compelling solution: software programmable Soft processors are end-user customizable Different application realm than hard ASIC processors Can add more features: trade area for performance Well known approach: add custom instructions (ex. A*B+C) Techniques orthogonal to custom instructions 2 Application-Specific Code Generation Application Compiler
Use default gcc, ISA? Interested in app-specific optimizations Processor Customized for: Area Power Wallclock time Freq. requirements 3 Infrastructure 4 SPREE System [CASES 05, FPGA 06] (Soft Processor Rapid Exploration Environment) ISA
Processor Description Input: Processor description Datapath Made of hand-coded components SPREE System 1. 2. 3. SPREE RTL Verify ISA against datapath Datapath Instantiation Control Generation Multi-cycle/variable-cycle FUs Multiplexer select signals Interlocking Branch handling Output: Synthesizable Verilog 5 Back-End Infrastructure 20 Benchmarks (MiBench,
Dhrystone 2.1, RATES, XiRisc) RTL Modelsim RTL Simulator 1. Cycle Count Quartus II 5.0 CAD Software Stratix 1S40C5 2. Area 3. Clock Frequency 4. Power We can measure area/performance/energy accurately 6 Area efficiency A combined metric: MIPS 1000 LEs
#Million Instr. WallclockTime x Area #Million Instr. x Frequency # Cycles x Area 4 criteria trading-off (power not included) Want app-specific ( average) improvement 7 Representative Processors Serial F/D/R/EX/WB Pipe3 F/D Pipe5 F D Pipe7
F D R/EX/M R/EX1 R F: Fetch <900D: LEs, Decode <70 MHz R: Register EX: Execute M: Memory WB: Writeback WB EX2/M EX1 WB EX2/M
EX3/WB1 >1500 LEs, >100 MHz WB2 8 SPREE vs Nios II faster Geomean Wall Clock Time (us) 1900 SPREE Processors Altera Nios II/e 1700 Altera Nios II/s Altera Nios II/f Serial 1500 1300 1100 900 700
Reducing hardware shift support Removing hazard detection logic B. Impact of unique ISA features Removing delay slots Hi/Lo registers vs 3-operand multiplies Using unaligned memory load and stores C. Application-specific register management Operand scheduling and forwarding lines
Limiting the use of architected registers D. Combining these into app-specific optimizations 10 Reducing Hardware Shift Support Best performance per area: Using hard multiplier for shifting Multiplications and shifts: both in software? Software shifting using additions & subtractions Impact of removing the dedicated LUT-based shifter? Costs ~250LEs, 30% of smallest soft processor Can we have partial hardware support for shifting? 11 Area for Various Shift Strategies (Pipe3) 1400 Equivalent LEs 1200 343 LEs
1000 48 LEs 800 600 400 200 0 Multiplier based Shifter Shifter 2 Fixed Shifters 1 Fixed Shifter No Hardw are Support 2 fixed-amount shifters is cheap! 12 Dynamic Instructions Containing Shifts
35 30 20 variable fixed 15 10 mean dhry dct gol patricia dijkstra FFT_MI stringsearch sha qsort
CRC32 bitcnts vlc turbo iquant quant fir fft des 0 crc 5 bubble_sort Percentage 25 some benchmarks have very few shifts
less than 2% of shift amounts are variable 13 How to get rid of the shifter Software-only shifts require an order of magnitude more cycles to compute Measure the cost in cycles for each shift operation Replace shifts by hard shifts and/or software shifts: Srl 8 Srl 8 Srl 16 or Srl Srl Srl Srl or 4 4 4 4 Srl 3 Srl 3 Srl 3 Srl 3 Srl 3
Shift_left(1) ... or Evaluate cost in cycles for all combinations of shifters available 14 Impact of up to 2 Fixed-Amount Shifters (pipe3) Area efficiency (MIPS/1000LE) 100 90 80 70 60 default 50 fixed-shifters 40 30 20 10 0 bubble_sort des
fir mean Can improve area efficiency by up to 65% Beneficial for certain applications only 15 Removing Delay Slots load hazard avoided PC hazard avoided Instr. in Instr. in Branch/ Load delay slot delay slot Jump F/D R/EX/M WB F/D R/EX/M
Time WB Time Default MIPS has branch and load delay slots Under what conditions are they worth it? Load delay slots need no additional hardware support Because of hazard detection in the processor Branch delay slots require hardware support We only have predict-not-taken so far Are working on better branch prediction 16 Removing Load Delay Slots (serial) Normalized Wall-Clock Time 1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00
FFT_MI dijkstra gol dct dhry mean 3% better performance for Serial, 2% for Pipe3 17 Removing Branch Delay Slots Normalized LEs / MHz / nJ/Instr 1,2 1,1 Area 1 Frequency EnergyPerInstr 0,9 0,8
pipe3 pipe5 pipe7 pipe3: 7% performance hit pipe7 improvements: 13% freq, 8% performance 18 Register file 3-Operand Multiplies vs Hi/Lo Registers Hi/Lo Multiplier MUX Default MIPS has Hi/Lo registers Motivated by multi-cycle multiplies Hold multiplication results (Hi and Lo each 32 bits) Two special instructions to access Hi/Lo Which to
choose? 3-operand multiplies (NIOS2 and Microblaze) Two instructions compute high and low parts Result is stored in register file 19 Impact of 3-Operand Multiplies 1,2 8% slower clock Normalized Value 1 0,8 32 LSB 0,6 32 LSB / 32 MSB 0,4 0,2 0 Area Frequency
EnergyPerInstr Saves area, reduces frequency, increases power 20 Impact of 3-Operand Multiplies Normalized wall-clock time 1,2 1,1 1 0,9 0,8 serial pipe3 pipe5 pipe7 Only pipe3 benefits from cycle savings 21
Forwarding Lines and Code Generation Necessary to forward both operands (A and B)? Simultaneous dependences r3 = r1 + r2 r4 = r3 + r3 Non commutative operations r3 = r1 + r2 r3 = r1 + r2 r4 = r5 - r3 r4 = r3 r5 Compiler can reorder commutative operands of instrs Can compiler compensate when only one forwarding line? Save ~30 LEs for fwding line and incur more stall cycles? Added 1-2% cycle improvement with 1 fwding line 3-4% short of 2 fwding lines performance 22
Soft Processor Customization Techniques 1. Best overall (general purpose) processor 2. Best per application (application-tuned) 3. Reduce processor by reducing the ISA (Subset) SPREE automatically removes 4. Unused connections Unused components Unused parts of the ISA Apply optimization techniques (Opt) 23 Efficiency (MIPS/1000LEs) Average Combined Improvements (pipe3)
Subsetting +8% Opts +12% Subsetting & Opts +25% 36% App-Specific: +11% Opt: 2 fixed shifts, no dly slots, 3-op mult, op sched overall 36% improvement in efficiency! 24 Summary Conclusion App-specific 1. Software-only and custom shifters 2. Load delay slots
Useless with hazard detection Branch delay slots Useful with poor branch prediction 4. 3-operand multiply Processor-specific 5. Operand scheduling to save a forwarding line App-specific 12% area efficiency over app-specific processor 17% area eff. over subsetted app-specific proc. without adding complexity! 25 Future Research Integrating branch prediction in SPREE Research on memory hierarchy Automatic selection of app-specific SP features 26 Thank you 27
Architectural Parameters Used in SPREE Multiplication Support Shifter implementation Hardware FU or software routine Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Forwarding lines We focus on core microarchitecture (for now) 28
Related work Custard [Dimond, Mencer, Luk] Customizable forwarding lines Optional delay slots NIOS II [Altera] 3-operand multiply No delay slots Microblaze [Xilinx] 3-operand multiply Branches with and without delay slots No specific evaluation of studied features in SP 29 Removing some/all hazard detection logic Pipe3 F/D R/EX/M WB Potential hazard F/D stall
R/EX/M WB Time Can the compiler compensate with scheduling? E.g., worst case use no-ops to ensure correctness Challenge: variable, multi-cycle instructions What is the cost/benefit of doing so? 30 Impact of Removing Hazard Detection Logic Normalized LEs / MHz 1,1 1,0 Area Frequency 0,9 0,8 pipe3 pipe5 pipe7
Cyclometalated iridium complexes. Have generated intense interest due to excellent photophysical properties. Rational design and selection of the cyclometalating ligand and ancillary ligand provide an opportunity to control the photophysical, electrochemical, and steric properties
Times New Roman Arial 宋体 WP Greek Century Symbol PMingLiU Default Design 1_Default Design 3_Default Design Microsoft Graph 97 Chart Microsoft Excel Worksheet Microsoft Imager Picture Microsoft Word Document Black Carbon:Global Budget and Impacts on Climate Slide 2 Scanning electron...
Biological Laboratory Safety Additional Information Where to start Before looking at this presentation, ensure that you have watched the one on General Laboratory Safety What are the additional hazards from using Biological Materials?
(Estonia reforms) Entrepreneurship is one: active skills and/or skills for business start-up (Austria steps ahead) Developing broad occupational competencies through workplace learning is now the key factor for VET reform in Europe (the Netherlands define) Validation of informal and non-formal...
Stage 3 - Black Putrefaction. Most of body mass lost mainly due to the activity of the maggots and other insects. Body's organs will degenerate into soupy mess. Black Putrefaction occurs 10-20 days after death. Stage 4 - Butyric Fermentation.
vi sdr de concordia. xi sdr de curitibanos. xii sdr de rio do sul. xv sdr de blumenau. xix sdr de laguna. xx sdr de tubarÃo. xxi sdr de criciuma. xxiii sdr de joinville. xxvi sdr de canoinhas. xxvii sdr...
After several sessions, the rock shatters. Sharp, angular pieces of rock called scree fall to the foot of the slope. At night temperatures drop below freezing and the water changes into ice. Glacial ice freezes into the cracks and crevices...
Ready to download the document? Go ahead and hit continue!