Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005 Presentation Overview Current Research Direction Related Work Experiments What Next? 2 Current Research Direction Wide superscalar, out-of-order execution processor core Exploits ILP But true data dependencies are inherent in application programs MIPS R10k, NetBurst, AMD etc. use bypass network to forward just-computed result allow back-to-back issue of dependent instructions Complexity of bypass network grows quadratic w.r.t. issue width Hardware Components of MIPS R10000 Fetch Instruction Predecode Branch History Table Instruction Cache Instruction TLB Decode Active List (32 entries)
Instruction Decode Register Map Tables (1 for Int, 1 for FP) Free Register Lists (1 for Int, 1 For FP) Issue Integer Queue (16 entries) Mem Queue (16 entries) Integer Registers / Bypass 64 x 64 bits FP Queue (16 entries) Write FP Registers / Bypass 64 x 64 bits Execute Address + Data TLB ALU1 ALU2 FP + FP *, , Data Cache [Yeager96]
3 Current Research Direction Observation 1: Multi-cycle broadcast Wire delays accounted for in Intel NetBurst Allows higher processor clock frequency at the cost of reduced IPC Observation 2: FP execution unit is idle most of the time, even in FP-intensive applications (5-10%) [Sassone04] Proportion of Functional Unit Type Requested Proportion of Functional Unit Type Requested 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 90% 80% 70% 16 4. gz ip 16 16 .gr 4. 4. ap g z g hi 16 i p zi p c 4 .p .l
16 .g zi rog og 4. p.r ra gz a m i p nd .s om ou 17 rc e 6 1 .g 19 81 cc 16 7. .m 8. p a cf w rs up e 17 wi r 1 se 17 .s w 3. im ap 18 179 plu 3. . a e 18 qu rt 8. ak a e 30 mm 1. p ap si 60% SPEC2000 Applications 50% Rd/Wr Ports Rd/Wr Ports 40% FP_MULT/DIV FP_MULT/DIV
30% FP_ALU FP_ALU 20% Int_MULT/DIV Int_MULT/DIV 10% Int_ALU Int_ALU 0% MediaBench Applications 4 Literature Survey [Epalza04]: Dynamic Allocation of Functional Units in Superscalar Processors Switch the execution mode of idle floating-point units to four additional integer ALUs Addition of bypass networks add 1 cycle latency to the modified FPU. 19% speedup for SPECint2000 3.5% speedup for SPECfp2000 Issues: Need to improve control for mode switching 5 Plans Other patterns:
x1 x1 x2 x3 y Possible inputs: 1. fromregister file 2. fromnode x1 3. fromnode x2 4. fromnode x3 Possible 1st input for node y2: 1. fromnode x1 2. fromnode x2 y2 x1 Possible 2nd input for node y1/y2/y3: 1. fromregister file 2. fromnode x y1 Possible 2nd input for node y1: 1. fromregister file 2. fromnode x1 3. fromnode x2 x2 y1 x y3 Possible 2nd input for node y1: 1. fromregister file 2. fromnode x
x y1 y2 Possible 2nd input for node y: 1. fromregister file 2. fromnode x x Possible 2nd input for node y: 1. fromregister file 2. fromnode x1 3. fromnode x2 Possible 2nd input for node z1: 1. fromregister file 2. fromnode x 3. fromnode y y Possible 2nd input for node z: z 1. fromregister file 2. fromnode x1 3. fromnode x2 4. fromnode y z1 Possible 2nd input for node y2: 1. fromregister file 2. fromnode x Possible 2nd input for node z: 1. fromregister file 2. fromnode x 3. fromnode y1 4. fromnode y2 z
Possible 2nd input for node y2: 1. fromregister file 2. fromnode x1 3. fromnode x2 x2 y y2 z2 Possible 2nd input for node z2: 1. fromregister file 2. fromnode x 3. fromnode y 6 Related Work [Palacharla97] Dependence-based (FIFO queues + clustered execution units) 7 Related Work Extension to rePLay framework [Yehia04] 8 Experiment : Chaining pairs of dependent instructions [Intel01] Double-speed ALUs [Vassiliadis96] 3-1 Interlock Collapsing ALUs from Register File Normal Integer ALU Result of first instruction in dependent sequence
4 stages Carry Lookahead Adder 1 stage Logic Operations 3-1 Interlock Collapsing ALU Result of second instruction in dependent sequence Control 1 stage mux 4 stages Carry-Save Adder Logic Operations Carry-Lookahead Adder + Logic Operations 9 Experiment : Chaining pairs of dependent instructions ruu_fetch()
Instruction Fetch Queue (IFQ) F_MEM Register Update Unit (RUU) Operands ready Load/Store Queue (LSQ) ruu_dispatch() EA ready Ready Queue Modifications to sim-outorder for SimpleScalar PISA. ruu_issue() if the requested functional unit is IntALU && the list of in-flight instructions waiting only on the result of this instruction is non-empty && the chained ALU is not busy => schedule this instruction and the first obtained dependent instruction to the chained ALU Issue if requested functional unit is not busy IntALUs Int Mult/Div Rd/Wr Ports FP Adders FP Mult/Div Chained ALU Event Queue ruu_writeback()
Instruction WriteBack (Broadcast/Bypass Logic) Branch Misprediction? If so, recover ruu_commit() Instruction Commit 10 Experiment : Chaining pairs of dependent instructions Speedup on IPC for MediaBench Applications (fetch-decode-issue-com m it w idth = 8, ruu:size = 32, #ialu = 8, #cialu = 2) 2 CIALUs sufficient 25% IPC improvement of ~8%, solely due to savings of broadcast cycles 20% 15% Reduces utlization of IALUs by ~50% broadcast_ delay =1 10% broadcast_ delay =2 broadcast_ delay =3 Reduces up to 45% of queue entries waiting for result broadcast_ delay =4 5% 0%
ep ic un ic ep en e cod de e cod e cjp g djp eg mp eg MediaBench Applications n 2e de co mp eg e
2d e cod p w it eg c en p w it eg c de 11 What Next? Chaining sequence of 3 dependent instructions, other patterns out of the 80. Architectural impact of adding chained units complexity of local bypass network etc. Replace chained units by xALUs converted from the CSA trees in a FP multiply/divide unit Need to explore the hardware circuits of FP multiply/divide Develop an adaptive configuration scheme to best match the interconnections of the swappable xALUs to the patterns of in-flight instructions. Need to determine the most frequent subset of patterns 12 References [Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis Vassiliadis. [Yeager96] The MIPS R10000 Superscalar Microprocessor. Kenneth C. Yeager. IEEE Micro 1996. [Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. Complexity-Effective Superscalar Processor. ISCA 1997. [Intel01] The Microarchitecture of the Pentium 4 Processor. Glenn Hinton, Dave Sager, Mike
Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q1. 2001. [Epalza04] Dynamic Reallocation of Functional Units In Superscalar Processors. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), 2004. [Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam. [Sassone04] Multicycle Broadcast Bypass: Too Readily Overlooked. Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004. Thank You Overview of Research Topic Goal of this research: investigate the feasibility and potential benefit of effective, automated runtime compilation and execution of software binaries on reconfigurable microprocessors Software binaries executing only on superscalar processor Continue normal execution of binary code following the transformed critical code region. Software Binaries Monitor the execution of the coupled system Profile committed instructions to identify critical code regions Identify and extract suitable instructions from critical code regions for collapsing into complex, atomic instructions
Reconfigurable Microprocessor superscalar processor Transfer execution from superscalar processor to the reconfigurable unit reconfigurable logic Assembly-to-hardware mapping of collapsed instructions On the next execution of the transformed critical region, load configuration for the reconfigurable logic 15 Motivations Improved execution performance by exploiting parallelism and redundancy in hardware. Adaptation of hardware resources based on the dynamic behaviour of programs. Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time. Compilation at the binary level allows execution of legacy software binaries. Runtime compilation allows transparent migration of software code to hardware. 16
Harish Sankar, Amrita Alumni, ECE 2011-2015 (Forgot CGPA and all. "Remember the good, Forget the bad") ... In the 17th century, the city became a part of the Kingdom of Mysore and remained so until its conquest by the British...
Animal Farm . and . Nineteen Eighty-Four. Politics "In our age, there is no such thing as 'keeping out of politics'. All issues are political issues, and politics itself is a mass of lies, evasions, folly, hatred, and schizophrenia." ......
Folktales - stories to pass on moral values/lessons. Trickster Tales - a specific type of folktale with a "trickster" character. ... If you were directing the music video to this song, what types of images/video clips would you include to...
Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES...
C++ Meta<Programming> Concepts and Results Walter E. Brown [email protected] f Fermi National Accelerator Laboratory What is Metaprogramming? Overview Concepts: Generalized parameterization via C++ templates Kinds of templates; kinds of template parameters Selected results: Meta-algorithms & generic programming Templates as general...
Life Online. by Mark Oestreicher and Adam McLane, 2012. Sherry . Turkle's. well-known book on how the new technology may be removing intimacy from our close . relationships: Alone . Together: Why We Expect More from Technology and Less from...
Ready to download the document? Go ahead and hit continue!