Coarse Grain Reconfigurable Architectures

Coarse Grain Reconfigurable Architectures

PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 Reiner Hartenstein TU Kaiserslautern How to cope with the Power Wall >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 2 http://hartenstein.de The Workshop Series TU Kaiserslautern spin-off from the PATMOS project Project .l eader:

Reiner Hartenstein partner . leader: Antonio Nez partner . leader: Francis Jutand . . Oldest conference series on power efficiency http://hartenstein.de/PATMOS/ Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels - not only in hardware design and software development however, there is still a lot to be done 2015, [email protected] 3 http://hartenstein.de Power Efficiency of Programming Languages

(an example) 2015, [email protected] 4 http://hartenstein.de ICT infrastructures Already the internets 2007 carbon footprint was higher than that of worldwide air traffic TU Kaiserslautern Power consumption by internet: x30 til 2030 if trends continue Columbia riv G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 11 Sep 2008 er Its more than the entire worlds total power 2015, [email protected] to-day !!! 5 consumption New York Times Data Center at Dallas 5

IDC predicts: the total number of data centers http://hartenstein.de around the world will peak at >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 6 http://hartenstein.de Terminology Problems TU Kaiserslautern Searching a more power-efficient machine paradigm Stressing differences to Control-Flow Computers an area called Dataflow Computers was started mid 70ies at MIT Xputer area people are forced to sidestep by using terms like data-driven or data streams Tagged Token Flow although the Dataflow scene is IStructure-centered. http://hartenstein.de 2015, [email protected]

7 A Second Opinion TU Kaiserslautern D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active workshops , e. g.: 5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here 2015, [email protected] 8 http://hartenstein.de Computing Paradigms term term CPU CPU Xputer RPU DFC* DFC DPU program counter

rDPU data counters DPUs I- progra m counter execution triggered by yes instructio n fetch (von Neumann) no data arrival** data-driven or data-streambased complicat ed Istructure handling Dataflow Computer no

structure *) based on tagged token I-Structure 2015, [email protected] paradigm instructionstream-based Reconfigurable Computin 9 **) transport-triggered http://hartenstein.de >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 10 http://hartenstein.de Speedup -Factor 10 TU Kaiserslautern

Speedup-Factor + Pre-FPGA solutions 6 Sav ing P o wer processing, Xputer Image Pattern matching, 28514 10 DES breaking 343 Multimedia 9 5 DSP and DNA & protein Design Rule Check real-time face sequencing accelerator PISA detection wireless 15000 (fair 77 6000 comparizon) no FPGA:

DPLA on MoM by TU-KL* 1984 1984: 1 DPLA replaces 256 FPGAs fabricated by E.I.S. Multi University Project Chip http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html Speed-ups by vN Software to FPGA Migrations 0 10 1985 1990 3000 9 imaging 1000 ~3 00 730 1000 pattern Viterbi

900 400 recognition Smith-Waterman Decoding 288 pattern matching SPIHT wavelet-based image compression 457 100 FFT BLAST 52 40 88 molecular protein dynamics identification simulation p 10 103 av w in er ~ : g sp 1 0 ee % d- o u f

3 8723 CT video-rate MAC Crypto stereo vision As tr op x2 hy /y si Po cs r s >15 years earlier Reed-Solomon Decoding 2400 Bioinformatics 20 20 GRAPE 0 10 1995 2015, [email protected] *) TU Kaiserslautern 11 http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf

2000 2005 2010 http://hartenstein.de The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion von Neumann: an extremely powerinefficient paradigm von Neumann Syndrome Reinvent Computing 2015, [email protected] C.V. RamamoorthyVon Neumann Syndrome 12 http://hartenstein.de Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015 http://reconfigurablecomputing4themasses.net/ 2015, [email protected] 13

http://hartenstein.de >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 14 http://hartenstein.de TU Kaiserslautern Dual paradigm mind set: an old hat but was ignored time to space mapping: procedural to structural: loop to pipe mapping is s i h T o ! RTMs: s ple isionPDP-16 token bit s i m n el V

1971 Tun evoke FF why did it take 25 years to find out? FF FF 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of RegisterTransfer Modules (RTM's); IEEE Trans-C21/5, May 1972 2006, [email protected] 15 http://hartenstein.de 1980 The Systolic Arrays (1) no instruction streams needed time (pipe network) DPA x x x DataPath Array (array of DPUs) time define: ... which data item at

which time at which port x x x | x x x input data stream execution transporttriggered | | port # time x x x x x x - - - - x x x - - - - x x x x x x - - - - -

- - x x x port # 2015, [email protected] | | | | | | | | | | x x x time 16 x x x | Design of Special-Purpose

VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 | | data M. J. Foster,streams H. T. Kung: The port # output data streams | x x x H. T. Kung http://hartenstein.de TU Kaiserslautern l a i c e p s . y l ? n

e o s t o s ju purp M. J. Foster and H. T. Kung: The Design of SpecialPurpose VLSI Chips ... It is not sufficient to invent something. You need to recognize, that you have invented something. 2015, [email protected] 17 (Tunnel Vision) http://karl-steinbuch.org Systolic Arrays (2) Karl Steinbuc 17 http://hartenstein.de What Synthesis Method? H. T. Kung: of course algebraic! (linear projection) only linear pipes supports only very special applications with strictly regular data dependencies http://kressarray.de/

The supers generaliza ystolic array: tio a ion of the s y stolic array now a gen : eral purpo se method ology ! My student Rainer Kress replaced it by simulated annealing: this supports also any irregular & wild shap pipe networks 2015, [email protected] *) KressArray [ASP-DAC-1995] 18 http://hartenstein.de H. T. Kung: Its not our job TU Kaiserslautern Who the d genera tes* atas trea m s? another Tunnel Vision Symptom without a sequencer: missed to invent a

new machine paradigm (the Xputer) 2015, [email protected] *) or receives 19 http://hartenstein.de >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 20 http://hartenstein.de The Xputer machine Paradigm TU Kaiserslautern is the TU-KLs Symbiosis of (TU-KL) Time to Space Mapping and Xputer literature Reconfigurable Computing! ASM obtained by adding auto-sequencing memory (ASM)

With data counters instead of a program counter ASM ASM GAG GAG: Generic Address Generstor 2015, [email protected] 21 ASM RAM data counter http://hartenstein.de state register(s): MoPL program counter: data counter(s): TU Kaiserslautern But there is an Asymmetry Asymmetry 2006, [email protected] 22 rn Flowware Languages read next data item goto (data address) jump to (data address) data loop

data loop nesting data loop escape data stream branching yes: internally parallel loops more simple: no ALU tasks Xputer literature lea data eas -proce y t dural o con tr ol - p ro c edu ral Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Xputer pages Duality of procedural Languages

http://hartenstein.de Compilation: Software vs. Configware u. Flowware TU Kaiserslautern Xputer literature Software Engineeri ngprogram source Configwar C, etc. time to e space Engineerin source mapping g program mapper software compiler procedural: time domain) configware compiler space domain datascheduler time domain software code configware code flowware code 2015, [email protected]

23 http://hartenstein.de Heterogeneous: Co-Compilation TU Kaiserslautern Xputer pages C, or other high level language automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler datascheduler software code configware code flowware code 2015, [email protected] 24 http://hartenstein.de >> Outline << TU Kaiserslautern The Power Wall Dataflow Computing Reconfigurable Computing Time to Space Mapping

The Xputer Paradigm Conclusions http://www.uni-kl.de 2015, [email protected] 25 http://hartenstein.de Illustrating the Paradigm Trap TU Kaiserslautern The von Neumann The Memor y Wall the watering can model [Hartenstein] (1) The von Neumann Manycore Approach ( many single core Approach crippled watering cans ) ( crippled watering can ) 2015, [email protected] 26

many von Neuman n bottlenecks http://hartenstein.de Illustrating the Paradigm Trap (2) TU Kaiserslautern The Dataflow Computer extremely complicate d: no watering can model ! (a power efficiency break-thru did not happen here) 2015, [email protected] 27 http://hartenstein.de Xputer: the only massively power-efficient Paradigm TU Kaiserslautern the watering can model [Hartenstein] The Xputer Paradigm has no

von Neuman n bottleneck 2015, [email protected] 28 http://hartenstein.de We need a Seismic Shift TU Kaiserslautern to avoid unaffordable ICT power consumption cost in the future This is an extremely massive challenge, since more than a half century of software sits squarely on top The Aristotelian model with the von Neumann CPU as the center of the world has become obsolete To cope with power wall and manycore we have to accept a Copernican world of Heterogeneous Systems For many more years we need a heterogeneous triple-paradigm mind set: Configware, Flowware, and still even Software 2015, [email protected] 29 http://hartenstein.de thank you ! 2015, [email protected] 30 http://hartenstein.de

END 2015, [email protected].de 31 http://hartenstein.de Backup for discussion: 2015, [email protected] 32 http://hartenstein.de TU Kaiserslautern First FPGA available 1984 from Xilinx LUT LUT LUT LUT LUT LUT Table Table LUT 2015, [email protected] http://hartenstein.de

Transformations since the 70ies (time to time/space mapping) Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag] time domain: procedure domain program loop space domain: structure domain Strip Mining Transformation Pipeline n x k time steps, 1 CPU time algorithm 2015, [email protected] k time steps, n DPUs space/time algorithmus 34 http://hartenstein.de The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the

Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion etc. Reinvent Computing Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. http://hartenstein.de 2015, [email protected] 35 ASM | | | use data counters, no program counter - - - x x x ASM - - - - x x x ASM x x x - -

- - - - - x x x ASM | | | | | | | | | | x x x 36 x x x ASM x x x

ASM ASM: AutoXputer pages Sequencing Memory 2015, [email protected] | ASM a dat lso m a r or ou e i tes rre sup gula po r rt e d | | Xputer machine paradigm x x x x x x x x x x x x -

| implemented ASM by distributed ASM on-chipmemory ASM x x x ASM ASM general purpose reconfigurabl e (pipe network) rDPA ASM Data stream generator usage GAG RAM data counter GAG: Generic Address Generators (reconfigurable: to avoid memory-cyclehungry address http://hartenstein.de computation) Paradigm Shift Consequences Xputer literature

TU Kaiserslautern von Neumann: Software Engineering resources: CPU 1 software fixed algorithm: variable programming Program Counter (PC) source needed Xputer: Configware Engineering configware resources: variable 2 programming flowware algorithm: variable Configuratio n Code (CC) sources Data Counters (DCs) sequencing code (e. g. see MoPL language) Xputer and Heterogenous vN: needed Engineering all 3 programming sources

needed 2015, [email protected] 37 http://hartenstein.de Pipelining through DPU Arrays: the TU-KL Xputer principle no memory wall input data streams | | x x x x x x - - - - x x x - - - - x x x x x x - - - - - - - x x x no instruction streams no message passing nor thru common memory 2015, [email protected]

| 38 | | | | | | | | | | | x x x x x x | DPU operation is transport-triggered DPA

| massively avoiding memory cycles x x x x x x x x x output data streams | x x x DPA = DPU array DPU = Data Path Unit http://hartenstein.de IMIT would Token Dataflow Tagged Token Flow Architecture* call itTagged I-Structure Storage

no PC no updateable global store PE: I-Structure Storage RU to/from the Communicatio n Network ? instruction is executed SU *) source: Jurij Silc Form Token Unit even if some of its operands are not yet available 2015, [email protected] Communication Network Token Queue PE PE Wait-Match Unit & Waiting Token Store

Instruction Fetch Unit ALU & Form Tag 39 Program Store & Constant Store http://hartenstein.de Solution: I-Structure concept* TU Kaiserslautern Problems with Dataflow Architectures very complex data structures ! read request deferrend but write operation is allowed at least one read request has been deferred ? can be read but not written ? each update consumes the structure and the value produces a new data structure [source: Jurij Silc] awkward or

= Tagged Token Structure even impossible to *) I-Structure Flow implement instead of Dataflow 40 http://hartenstein.de 2015, [email protected] I-Structures (I = incremental) - part 1 41 Jurij Silc: Dataflow Architectures http://csd.ijs.si/courses/dataflow/ 26 28 28 2015, [email protected] http://hartenstein.de 27 I-Structures - part 2 Jurij Silc: Dataflow Architectures http://csd.ijs.si/courses/dataflow/ MIT Tagged-Token Dataflow Architecture (PE)

42 31 29 I-structure select 2015, [email protected] Istructure assign 30 http://hartenstein.de 20 Reconfigurable Computing (RC): the intensive Impact TU Kaiserslautern Speed-ups by von Neumann to RC Migrations Tarek ElGhazawi [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Application DNA and Protein sequencing . DES breaking Speed-up

factor 8723 28514 much less memory and bandwidth needed 2015, [email protected] Savings divisor Power Cost Size 43 779 3439 22 96 253 1116 much less massively equipment needed saving energy http://hartenstein.de Taxonomy TU Kaiserslautern Flynns taxonomy: von Neumann only Dianas taxonomy: Reconfigurable Computing Reiners taxonomy:

heterogeneous systems Reiners 2nd taxonomy: Xputers only noI 2015, [email protected] 44 http://hartenstein.de Programming Language Popularity (IEEE) source: iStockphoto not yet determined by power efficiency 2015, [email protected] 45 http://hartenstein.de TU Kaiserslautern Von Neumann Syndrome Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; 2015, [email protected] etascript publishing 2011

e h t o t g n i t n f i o v Sh f o n e a c n d a e domin n-only caus f n o a e m g u a Ne m a d d

e t a l u f o s cum n o i l l i r t t s a e l not at f i , s r a l l . Do s n o i l

l i r d qua 46 http://hartenstein.de

Recently Viewed Presentations

  • ECVO exam Slide sample questions

    ECVO exam Slide sample questions

    3. Turtle, 5years old . This animal, housed in an aquarium and fed an insect/ meat diet, presented with bilaterally swollen and closed eyelids, and poor growth.
  • The Law and Economics of Tort and Criminal Law

    The Law and Economics of Tort and Criminal Law

    The law will promote efficiency if it From the standpoint of positive law and economics, one would predict that the principal defense to trespass on land would be The statement that best describes the difference between protecting entitlements by property...
  • Settlement in the West

    Settlement in the West

    Hudson's Bay Company vs North West Company. Major Trading Posts & Routes. The Metis . The Metis were a mixture of mostly French and First Nations people. Often members of the NWC. The practice of First Nations women and European...
  •  All energy on Earth depends on the Sun.

    All energy on Earth depends on the Sun.

    Bioaccumulation or Biomagnification is a problem in ecosystems because of energy flow. Lower level consumers get amounts of a pollutant that is not toxic, but top consumers get very toxic effects. Bioaccumulation: Arctic Food Chain Bioaccumulation: Great Lakes Food Chain...
  • Mining Heterogeneous Information Networks

    Mining Heterogeneous Information Networks

    1-dimensional ideal point model (Poole and Rosenthal, 1985; Gerrish and Blei, 2011) High-dimensional ideal point model (Poole and Rosenthal, 1997) Issue-adjusted ideal point model (Gerrish and Blei, 2012) (3) First train a global ideal point like 1. When talked about...
  • 2MW PM Machine Design for DirectDriven Wind Turbine

    2MW PM Machine Design for DirectDriven Wind Turbine

    The Ohio State University. April, 2010. Contents. Introduction. Major Wind Power System Configurations. Challenges to Remain in Power Grid. Why PM Direct-Driven WTG Getting Popular. Initial Design and Performance Analysis. Specifications and Sizing.
  • A Review of Mapping Wilderness Character for National Park ...

    A Review of Mapping Wilderness Character for National Park ...

    Habituated bears. 1. Physical Resources. Soil erosion - recreation/maintenance related loss of soil. 6. Air quality - sulphur. 5. Air quality - nitrogen. 5. Dammedwater. 2. Soil erosion - dam water levels affecting shoreline. 1. Biophysical processes. Departure from natural...
  • USC Clinical Trials Office (CTO) Answers to CTOs

    USC Clinical Trials Office (CTO) Answers to CTOs

    Clinical Trial Initiation In True 2.0. PI/Research Coordinator responds to all questions and provides mandatory documents (i.e. study protocol, proposed Clinical Trial Agreement if provided, and sponsors first proposed budget) for CTO's review (please refer to True 2.0 training slides).