Coarse Grain Reconfigurable Architectures

Coarse Grain Reconfigurable Architectures

Prof. Ahmed Hamani, Systems Architecture and Methodology, Electronic, Computer and Software Systems, School of Information and Communication Technology, KTH - Royal Institute of Technology, Stockholm, 11. September 2009 Reiner Hartenstein TU Kaiserslautern Programmer Education for the Multicore Era: the Twin-paradigm approach http://hartenstein.de >> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 2 http://hartenstein.de http://hartenstein.de [email protected] The spirit of the Mainframe Age For decades, weve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps finally tending to code sizes of astronomic dimensions Even in hardware courses (unloved child of CS scenes) we mostly teach von Neumann machine design deepening this

tunnel view 1951: Hardware Design going von Neumann (Microprogramming) 2009, [email protected] 3 http://hartenstein.de http://hartenstein.de [email protected] Program Performance [J. Larus: Spending Moore's Dividend; C_ACM, May 2009] Multicore computers shift the burden of software performance from chip designers to programmers. ... performance drops & other problems in moving single-core to multicore ... No, the law of More The law of Moore? Massively decreasing programmer productivity supercomputing: multi-disciplinary multi-location team works several years: software ready hardware obsolete Missing programmer population and methodology: a scenario like before the Mead-&-Conway 2009, [email protected] http://hartenstein.de revolution 4 4 http://hartenstein.de [email protected] Why we went multicore Four walls: Instruction-level parallelism Memory

Power Complexity Multicores promised to remove the walls Thousands of cores; boy, so many challenges 2009, [email protected] 5 http://hartenstein.de http://hartenstein.de [email protected] More cores instead of faster cores Avoiding the decline from growth industry to replacement business ? no, not without redefinition of the field not under the single-paradigm dictatorship sequential-only mind set dominating parallel algorithms mostly missing very difficult to program useful abstractions mostly missing 2009, [email protected] 6 http://hartenstein.de http://hartenstein.de [email protected] Very difficult to program Programing skills needed going far beyond sequential programing Massive Synchronization overhead Race conditions: reasons of bugs Non-determinism: new types of bugs Language and tool support missing 2009, [email protected] 7

http://hartenstein.de http://hartenstein.de [email protected] useful abstractions mostly missing Parallelism models machine-specific and low level shared memory use or message passing (hardware features) Parallel programming at assembly language level multi-threading, semaphores, locking (compaer & swap) Performance models are machine-specific Problems in portability, investment reuse, exonomics of scale 2009, [email protected] 8 http://hartenstein.de http://hartenstein.de [email protected] Which programming model to use? Stubborn consensus on the von Neumann paradigm its enforced monopoly-like dominance is the key problem it is incredibly inefficient (the von Neumann syndrome) No consensus on parallelism model data parallelism, message passing, (unstructured) ulti-threading, or Many applications use all three or even more Language & tool support needed to integrate models unqualified programmer population: Education reform needed 2009, [email protected] 9 http://hartenstein.de http://hartenstein.de

[email protected] Program Performance Multicore computers shift the burden of software performance from chip designers to programmers. [J. Larus: Spending Moore's ... performance drops & other Dividend; C_ACM, May 2009] problems in moving single-core to multicore ... Since People have to write code we anyway need adifferently, Software Education Revolution ... ... the chance to move RC* from niche to mainstream Missing programmer population and methodology: a scenario like before the Mead-&-Conway revolution 2009, [email protected] 1010 *) RC = Reconfigurable Computin http://hartenstein.de http://hartenstein.de [email protected] Power Consumption of Computers ... has become an industry-wide issue: incremental improvements are on track, (plain Green Computing) but we may ultimately need revolutionary new [Horst Simon, LBNL, Berkeley] solutions

More effective by orders of magnitude Twin Paradigm Green Computing Energy cost may overtake IT equipment cost in the near future Current trends will lead to unaffordable future operation cost of our cyber infrastructure 2009, [email protected] [Albert Zomaya] 11 http://hartenstein.de [email protected] end of the singlecore era relative performance pe e rfo negr rm edow an edth c For a Booming Multicore Era http://hartenstein.de ... e r o Mo

n a h t r e w o l much s Moores law y l n o n n a m u von-Ne parallelism GPGPU x86 94 96 98 00 02 04 06 08 10 12 14

year 16 18 20 22 24 26 28 30 von-Neumann-only is not the silver bullet Reconfigurable Computing is 12 2009, [email protected] 12 12 http://hartenstein.de http://hartenstein.de [email protected] machine model From CPU to RPU resources Reconfigurable Processing Unit sequencer right now property

programming source property programming source ASIC accelerator hardwired - hardwired - CPU hardwired - programmable Software (instruction streams) Configware Flowware programmable (configuration programmable (data streams) accelerator code) RPU now accelerators are programmable! non-von-Neumann 2009, [email protected]

state register program counter data counters we need 2 more program sources 13 http://hartenstein.de http://hartenstein.de [email protected] A Multicore Submarine Model? mapping parallelism just into the time domain: abstracting away the space domain is fatal C is not the silver bullet: its inherently seria But nobody wants to learn a new language. There is no easy way to program in parallel The programmer needs to understand how data flows through cores, accelerators, interconnect and peripherals The datastream model of the twin-paradigm approach helps to understand the space domain and parallelism The programmer* needs system visualization in the space domain, to understand performance under parallelism *) and, especially the student 2009, [email protected] 1414 http://hartenstein.de http://hartenstein.de

>> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 15 http://hartenstein.de http://hartenstein.de [email protected] The first Reconfigurable Computer prototyped 1884 by Herman Hollerith a century before FPGA introduction data-stream-based 60 years later the von Neumann (vN) model took over instruction-streambased The LUT (lookup table) 2009, [email protected] 16 http://hartenstein.de http://hartenstein.de [email protected]

widening the semantic gap [Harold Bud Lawson] unnecessar y complexity inside Burroughs B5000/5500: language-friendly stack machine IBM 260/370 & intel x86 highly complex instruction set MULTICS (GE, Honeywell): well manageable (impl. in PL/1) UNIX: complexity problems, compatibility problems Pascal killed by C, coming as an infection, along with UNIX KARL killed by VHDL, an infection coming along with Ada 2009, [email protected] 17 http://hartenstein.de http://hartenstein.de [email protected] Languages turned into Religions Java is a religion not a language [Yale Patt] teaching to students the tunnel view ^ of language designers falling in love with the subtleties of formalismes instead of meeting the needs of the user 2009, [email protected] 18 http://hartenstein.de http://hartenstein.de

[email protected] Appeals to people who do not know what they are doing It is alarming [Fred Brooks] Mastering even small complexity creates a deep feeling of satisfaction without solving the real problem The transition from machine level to higher level languages led to the biggest productivity gain ever made Its alarming that todays megabytes of code are compiled from languages at low abstraction levels (C, C++,Java) 2009, [email protected] 19 19 http://hartenstein.de the catastrophe gap http://hartenstein.de [email protected] complexity un conec m es pl sa ex ry ity

[Harold Bud Lawson] catastrophe gap migration to FPGAs: the silver bullet? lexity p m o c r e t s a m ability to year one of the reasons: the von Neumann syndrome 2009, [email protected] 20 http://hartenstein.de http://hartenstein.de [email protected] ? * Speed-up factors by GPGPUs (1) [Michael Garland, NVIDIA Research: Parallel Computing Manycore GPUs; IPDPS, Rome, Italy, June 25-29, 2009] on Speedup-Factor The power efficiency is disputable

GPUs can only be used only in certain ways. ? (up to ~150 x) such speed-ups by 103 GPGPUs only for embarrassingly parallel 149 146 applications 130 100 2 effective only at 10 47 Numerics 50 36 problems that can be 30 Imaging 20 18 solved using stream Bioinformatics processing. 1 10 Vide programmer has to learn o irrelevant graphics concepts 100Jan data copy from main July Jan July Jan July Jan 2007 2007 2008 2008 2009 2009 2010 memory to video memory

slow is 2009, [email protected] http://hartenstein.de *) migration from x86 singlecore 21 21 Speed-up factors by GPGPUs (2) http://hartenstein.de [email protected] NVIDIA GeForce GTX http://www.nvidia.co.uk/object/cuda_home_uk.html#state=home CUDA ZONE pages [NVIDIA Corp.]: non- minium stream processor power supply cores recommended 275 240 650680 watt 295 480 650680 watt Compute Unified Device Architecture (CUDA), accelerates BLAS libraries (Basic Linear Algebra Subroutines)

Less flexible than FPGAs. 2009, [email protected] (up to ~600 x) EDA 103 Speedup-Factor (GPGPU tool development years earlier than f. x86) reviewed CUDA user submissions Intel Core2 Quad (desktop PCs): 4 cores Intel Xeon "Nehalem-EX" for servers: 8 cores 675 340 500 270 327 420 250 250 169 270 170 150 260 169 138 150 109 172 100 100120 100 100 100 100 100 100 100 90 255 55 60

90 75 34 60 50 77 60 55 50 30 50 50 50 40 50 36 29 4035 30 50 39 50 50 35 35 31 35 32 35 26 27 29 25 23 26 1520 1630 20 20 17 20 16 15 13 15 15 12 13 10 12 10 10 10 10 10 10 1 10 10 9 10

9 7 8 9 7 5 5 5 5 5 4 8 4 . 3 3 .5 4 4 3 5 3 4 3 2 2 2 2 1.3 470 10 CFD Computational Fluid Dyamics Cryptography oil & gas DCC DSP 10 100Jan 2007 July

2007 Jan 2008 2222 July 2008 Jan 2009 Astrophysics Bioinformatics July 2009 Digital Content Creation Digital Signal Processing Graphics Imaging Jan 2010 Numerics Video & Audio *)http://hartenstein.de migration from x86 singlecore (up to ~30,000x) (200x) vs. GPU: almost 50x 2 orders of magnitude Image processing, Pattern matching, 28514 DES breaking

Multimedia DSP and real-time face detection 6000 Reed-Solomon Decoding 2400 video-rate MAC stereo vision pattern 730 1000 900 recognition 400 103 SPIHT waveletbased image compression do ub le s ... design techniques will evolve, by necessity, to satisfy the demands of reconfigurable hardware and software programmability. J. R. Rattner, DAC 2008 DNA & protein sequencing wireless r by Software to Configware migration

by FPGA: 6 ye a [email protected] 10 ev er y http://hartenstein.de Speedup-Factor Speed-up factors obtained 52 40 BLAST 288 457 FFT 88 protein identificatio n 8723 ~50x 3000 crypt CT imaging CUDA ZONE

o1000 Viterbi Decoding SmithWaterman pattern 100 matching molecular dynamics simulation (200x) Garland IPDPS09 Bioinformatics Astrophysics 20 20 intel supports direct front side bus access by FPGAs 2009, [email protected] 100 1995 GRAPE 2000 23 23 2005 2010 http://hartenstein.de [email protected] 10

Speedup-Factor Software vs. FPGA (2) http://hartenstein.de 6 34 Image processing, 39 Pattern matching, 28514 DES breaking Multimedia DSP and real-time face detection wireless 77 9 Reed-Solomon 8723 6000 Decoding 3000 CT imaging video-rate MAC crypt ~ stereo 30 o vision pattern 0 Viterbi recognition 2400 1000 730 900

1000 400 SPIHT waveletbased image compression 52 x2 / yr Massive Energy 103 Saving factors: ~10% of speedup factor DNA & protein sequencing 40 BLAST 288 457 FFT 88 protein identificatio n Decoding SmithWaterman pattern 100 matching molecular dynamics

simulation Bioinformatics Astrophysics 20 20 http://hartenstein.de 2009, [email protected] 2009, [email protected] GRAPE 100 1995 2000 2424 2005 2010 http://hartenstein.de http://hartenstein.de [email protected] RC*: Demonstrating the intensive Impact Tarek El-Ghazawi [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Application Speed-up factor

Power Savings Cost Size DNA and Protein sequencing 8723 779 22 253 DES breaking 28514 3439 96 1116 much less memory and bandwidth massively neededsaving energy *) RC = Reconfigurable Computing 2009, [email protected] 25 much less equipment needed http://hartenstein.de Why such Speed-up Factors ... http://hartenstein.de [email protected]

with FPGAs: a much worse technology massive wiring overhead + massive reconfigurability overhead + routing congestion growing with FPGA size The Reconfigurable Computing Paradox main reason: no von Neumann Syndrome! more recently also: more platform FPGAs 2009, [email protected] 26 http://hartenstein.de http://hartenstein.de >> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 27 http://hartenstein.de http://hartenstein.de [email protected] Reconfigurability per se is not the key Its the paradigm coming along with it Note: no instruction fetch at run time ! Data streams instead of instruction streams Enabling technology for data sequencers brings further performance improvements A non-reconfigurable example is the BEE

project (Bob Broderson et al., UC Berkeley) 2009, [email protected] 28 http://hartenstein.de http://hartenstein.de [email protected] data stream: an ambigouos definition Reconfigurable Computing is not instruction-stream-based its data-stream-based its different from the operation of the (indeterministic) dataflow machine other definitions also from multimedia area usable definition from systolic array area [email protected] http://hartenstein.de 29 2009, introducing Data time systolic array x streams input H. T. Kung et al., x x http://hartenstein.de [email protected] x x x

(pipe network) DPA time x x x | | | data stream [1979, 1980] port # time x x x x x x - - - - x x x - - - - x x x x x x - - - - - - - x x x execution transport-triggered port # 2009, [email protected]

| | | | | | | | | | x x x x x x | no memory wall inside | | port # time 30 | x x

x output data streams Flowware defines: ... which data item at which time at which port http://hartenstein.de http://hartenstein.de [email protected] Classic Systolic Array Synthesis algebraic methods i. e., linear projections yields only uniform arrays w. linear pipes only for applications with regular data dependencies 2009, [email protected] 31 http://hartenstein.de coarsegrained x x x (r)DPA x x x | ASM

ASM [email protected] ASM http://hartenstein.de The counterpart of the von Neumann machine x x x data counters: | | - - - x x x ASM x x x x x x - - - - - x x x ASM x x x - - - - - - - x x | |

| | | | | | | | 2009, [email protected] x x x | x x x ASM ASM x x x ASM instead of a program counter distributed memory |

| data counters | ASM Datas t rea m Machi ne 32 located at ASM memory (not ASM at data x ASM path) GAG RAM data counter ASM: AutoSequencing Memory http://hartenstein.de Who generates the Data Streams? http://hartenstein.de [email protected] Why the SA scene has missed to invent the new machine paradigm reductionist

approach: its not our job x without a data x x x x xsequencer its a machine !! x x not | x | | x x x x x x - - - - x xx - - - - xx x - - - - - x x x x x x - | | | | | | | | x | | x x | x x x x x x | (its not algebraic) 2009, [email protected]

c | | s ys to li 33 http://hartenstein.de http://hartenstein.de Algebraic Synthesis Methods [email protected] array systolic array applications regular data dependencies only supersystolic rDPA * pipeline properties shape resources linear only

uniform only mapping linear projection or algebraic synthesis simulated annealing or P&R algorithm no restrictions scheduling (data stream formation) (e.g. force-directed) scheduling algorithm *) KressArray [1995] 2009, [email protected] 34 http://hartenstein.de http://hartenstein.de [email protected] Generalization of the Systolic Array .... [Rainer Kress] discard algebraic synthesis methods flowware history: use optimization algorithms instead, for example: simulated annealing 1980: data streams (Kung, Leiserson) 1995: super systolic rDPA

(Rainer Kress) the achievement: also nonlinear and non-uniform pipes, and even more wild pipe structures possible 1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), now reconfigurability really makes sense 2009, [email protected] 35 http://hartenstein.de http://hartenstein.de [email protected] KressArray principles take systolic array principles replace classical synthesis by simulated annealing yields the super systolic array a generalization of the systolic array no more restricted to regular data dependencies now reconfigurability makes sense 2009, [email protected] 36 http://hartenstein.de Super-systolic Synthesis http://hartenstein.de [email protected] The key is array systolic array

mapping, rather t han applications regular data dependencies only supersystolic rDPA * pipeline properties shape resources linear only uniform only mapping scheduling (data stream formation) linear projection or algebraic synthesis simulated annealing or P&R algorithm no restrictions architectur (e.g. force-directed) scheduling algorithm

*) KressArray [1995] 2009, [email protected] 37 http://hartenstein.de http://hartenstein.de [email protected] Coarse-grained Reconfigurable Array example coming close to programmers mind set (much closer than FPGA) 3x3 fast onchip RAM image processing: SNN filter ( mainly a pipe network) rout thru only mesh-connected; exceptions: see ASM ASM ASM ASM ASM ASM .. note: kind ASM of software perspective ASM , but ASM without operator and routing not

Legend: rDPU not used backbus connect used for routing only port location usedmarker backbus connect instruction 32 bits wide ... ... question afte streams rDPU but you can r the talk: datastream array size: 10 x 16 = 160 rDPUs t implement v on Neuma s+ dn en compiled by Nageldingers KressArray Xplorer cisdio ! rungsg y (Juergen Beckers CoDe-X inside) pipelining 2009, [email protected] 38 http://hartenstein.de hypothetical branching example to illustrate software-to-configware migration http://hartenstein.de S = R + (if C then A else B endif); no

sp me m ee d-u ory p c fac ycle t or s: = 10 0 section of a major pipe network on rDPU [email protected] R B A + S C =1 clock 200 MHz C=1 simple conservative CPU example read instruction instruction decoding if C then read A read operand* operate & reg. transfers read instruction if not C then read B instruction decoding read instruction instruction decoding add & store operate & reg. transfers store result total (5 nanosec) 2009, [email protected]

memory nano cycles seconds 1 100 1 100 1 100 1 100 1 5 100 500 *) if no intermediate storage in register file 39 http://hartenstein.de http://hartenstein.de [email protected] The wrong mind set .... S = R + (if C then A else B endif); dec isio but you cant implement decision section of a very n large pipe network: R B A C =1 embarrassing remark - not knowing this solution: symptom of the

hardware / software chasm + 2009, [email protected] and the configware / software chasm 40 http://hartenstein.de http://hartenstein.de [email protected] introducing hardware description languages (in the mid seventies) The decision box becomes a (de)multiplexer This is so simple: why did it take decades to find out ? The wrong mind set the wrong road map! 2009, [email protected] 41 http://hartenstein.de http://hartenstein.de [email protected] Xplorer Plot: SNN Filter Example [13] http://kressarray.de 2 hor. NNports, 32 bit 3 vert. NNports, 32 bit route-thru-only rDPU 2009, [email protected]

+ result operand 42 4242 operator operand route thru backbus connect http://hartenstein.de http://hartenstein.de [email protected] language category both deterministic operation sequence driven by: state register address computation Instruction fetch parallel memory bank access Programming Language Paradigms Computer Languages Languages f. Anti Machine procedural sequencing: traceable, checkpointable read next instruction, read next data item, goto (instr. addr.), goto (data addr.), jump (to instr. addr.), jump (to data addr.), instr. loop, loop nesting data loop, loop nesting, no parallel loops, escapes, parallel loops, escapes,

instruction stream branching data stream branching program counter data counter(s) massive memory overhead avoided cycle overhead memory cycle overhead language features 2009, [email protected] interleaving only control flow + data manipulation 43 ve ry to ea le sy ar n m ul tip GA l e m G overhead avoided uc s po h no restrictions m w mo uc er r h fu e data streams only (no data manipulation) simmo l r pl e e http://hartenstein.de http://hartenstein.de

[email protected] Double Dichotomy 1) Paradigm Dichotomy Datastream Machine n Neumann Machine data stream instruction stream (Software- (Flowware-Domain) Domain) 2) Relativity Dichotomy space: time: -Structure Procedu (Software-Domain) (Configwarere Domain) 2009, [email protected] 44 http://hartenstein.de http://hartenstein.de [email protected] time Relativity Dichotomy space time domain: procedure domain time/space) (time

space domain: structure domain 3 phases: 1) reconfiguration of structures 2) programming data streams 3) run time 2 phases: 1) programming instruction streams von Neumann Machine 2) run time 2009, [email protected] Datastream Machine 45 http://hartenstein.de http://hartenstein.de [email protected] time-iterative to space-iterative n time steps, 1 CPU Often the space dimension is limited n*k time steps, a time to space mapping a time to space/time mapping 1 time step, n DPUs

n time steps, k POI DPUs IP loop transformation methodogy: 70ies and later e. g. example: bubble sort migration 2009, [email protected] 1 CPU 46 Strip mining [D. Loveman, J-ACM, 1977] http://hartenstein.de time to space mapping http://hartenstein.de [email protected] time domain: procedure domain space domain: structure domain time algorithm space algorithm pipeline program loop 1 time step, n DPUs n time steps, 1 CPU Shuffle Sort

Bubble Sort n x k time steps, conditiona swap l x y time 1 conditional swap algorithm unit 2009, [email protected] k time steps, n conditional swap units condition swap al condition swap al condition swap al condition swap al space/time algorithms 4747 http://hartenstein.de http://hartenstein.de [email protected]

Loop Transformation Examples sequential processes: resource parameter driven Co-Compilation loop 1-16 body endloop host: loop 1-8 trigger endloop loop 1-8 fork body body loop 1-8 loop 9-16 endloop body body endloop endloop loop 1-4 trigger endloop loop 1-2 trigger endloop join loop unrolling strip mining 2009, [email protected] reconf.array: 48 http://hartenstein.de MPSoC Programming model: Flowware http://hartenstein.de [email protected]

FMDemod Split LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Gather Adder Source: MIT StreamIT Speaker 2009, [email protected] [Pierre Paulin] Pros for streaming Streamlined, low-overhead communication (More) deterministic behaviour Good match for many simple media rich applications Cons control-dominated applications shunt yard problem Weve to find out, which applications types and programming models Students should exercise for the flowware approach 49 http://hartenstein.de

http://hartenstein.de [email protected] The new paradigm: how the data are traveling no, not by instruction execution transport-triggered: an old hat pipeline, or chaining asynchronous (via handshake) systolic array wavefront array 2009, [email protected] 50 http://hartenstein.de http://hartenstein.de [email protected] How the data are moved DMA, vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975] [TU-KL publ.: ASM use GAG generic address generator Tokyo 1989 + by the way: GAG st. by TI [TI patent 1995] NH journal] Henk Corporaal coins the term transporttriggered MoM: GAG-based storage scheme methodology [Herz*] Application-specific distributed memory [Catthoor et *)al.] [see Michael Herz et al.: ICECS 2002 (Dubrovnik)] 2009, [email protected] 51 http://hartenstein.de http://hartenstein.de [email protected]

The Paradigm Shift to Data-Stream-Based The Method of Communication and Data Transport by Software the von Neumann syndrome by Configware complex pipe network on rDPA 2009, [email protected] 52 http://hartenstein.de http://hartenstein.de [email protected] Illustrating the von Neumann paradigm trap the watering pot model [Hartenstein] The instruction-stream-based many watering pots approach The data-stream-based approach has no von Neumann bottleneck von

Neuman n bottleneck 2009, [email protected] 53 http://hartenstein.de http://hartenstein.de Main program: goto PixMap[1,1] JPEG zigzag scan pattern *> Declarations [email protected] HalfZigZag; EastScan is Flowware language example (MoPL): step by [1,0] SouthWestScan programming the datastream uturn (HalfZigZag) end EastScan; SouthScan is step by [0,1] 1 2 3 4 5 6 7 8 endSouthScan; x x NorthEastScan is loop 6 times until [*,1] y HalfZigZag step by [1,-1] data counter data counter endloop 1 end NorthEastScan; 4

2 3 HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag; 2009, [email protected] 3 4 5 6 7 data counter 54 54 data counter HalfZigZag 1 2 SouthWestScan is loop 7 times until [1,*] step by [-1,1] endloop end SouthWestScan; 8 y (an animation) http://hartenstein.de http://hartenstein.de

>> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 55 http://hartenstein.de http://hartenstein.de [email protected] Significance of Address Generators Address generators have the potential to reduce computation time significantly. In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750 Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead 2009, [email protected] 56 http://hartenstein.de http://hartenstein.de [email protected] Generic Address Generator GAG Generalization of the DMA Acceleration factors by: address computation without memory

cycles storge scheme optimization methodology, etc. 2009, [email protected] GAG data counter GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik] patented by TI 1995 57 http://hartenstein.de http://hartenstein.de [email protected] ASM: Auto-Sequencing Memory Generalization of the DMA Acceleration factor: generic address generator GAG for address computation without memory cycles ... partly explaining the RC paradox GAG & enabling technology: published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, patented by TI 1995 Dubrovnik] 2009, [email protected] ASM ASM: AutoASM ASM Sequencing Memory

GAG ASM GAG RAM GAG RAM dataGAG RAM RAM data counter data counter data counter counter data counters instead of a Program Counter Acceleration by Storge Scheme optimization methodology, etc. 58 http://hartenstein.de http://hartenstein.de [email protected] Migration benefit by on-chip RAM Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM so that the drastic code size reduction by software to configware migration can beat the memory wall multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions GAGs inside ASMs generate data streams GAG = generic address generator

rDPA = rDPU array, i. e. coarse-grained rDPU = reconf. datapath unit (no program counter) 2009, [email protected] ASM ASM: AutoSequencing Memory ASM ASM ASM rDPU rDPU rDPU ASM ASM rDPU rDPU rDPU ASM ASM rDPU rDPU rDPU ASM rDPA ASM ASM 59

ASM GAG RAM data counter http://hartenstein.de http://hartenstein.de [email protected] Acceleration Mechanisms parallelism by multi bank memory architecture auxiliary hardware for address calculation address calculation before run time avoiding multiple accesses to the same data. avoiding memory cycles for address computation optimization by storage scheme transformations optimization by memory architecture transforma 2009, [email protected] 60 http://hartenstein.de Configware Compilation ASM AutoConfigwar ASM ASM: Sequencing ASM C, FORTRAN e Memories GAG ASM MATHLAB RAM Engineerin GAG placeme

RAM source GAG nt & gprogram dataGAG RAM RAM http://hartenstein.de [email protected] routing configware compilation fundamentally different from software compilation data counter data counter data counter counter mapper configware compiler programmin g the data configware datascheduler counters code x x x x x x x x | x | | x x x x x x x x x - - 2009, [email protected] 6161

| | | | | | | | x | | x x | x x x x x x | pipe network flowware code | data streams | rDPA - - - x xx - - - - xx x - - - - - x x x http://hartenstein.de Generic Sequence Examples http://hartenstein.de [email protected]

L0 A atomic scan Limit Slider linear scan a) b) B0 Address Stepper video scan A Base Slider GAU -90 rotated video scan c) -45 rotated (mirx (v scan)) sheared video scan until non-rectangular video scan zigzag video scan d) e) f)

g) spiral scan feed-back-driven scans perfect shuffle 2009, [email protected] 6262 d e h s i l b pu 1990 in http://hartenstein.de GAG Slider Model http://hartenstein.de [email protected] scan pattern example for s c a n li n e n u m b e r : illustration of the slider model. 1 2 3 a) total address y b) x address sliders c) y address 9 8 7 6 5 4 3 2

1 y -s c a n l in e n u m b e r c) 3 2 Limit Slider a) 3 4 1 2 3 x - s c a n li n e n u m b e r 2009, [email protected] 5 6 7 8 9 x B0 Address Stepper A 1 2 1

L0 A Base Slider GAG Generic Address Generator b) sliders 6363 http://hartenstein.de http://hartenstein.de >> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 64 http://hartenstein.de http://hartenstein.de [email protected] Dual paradigm mind set: an old hat (mapping from procedural to structural domain)

Software mind set: instructionstream-based: flow chart -> token bit evoke FF control instructions FF FF action Mapped into a Hardware mind set: box = Flipflop, decision box = (de)multiplexer 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register-Transfer Modules 1972: (RTM's); IEEE Trans-C21/5, May 1972 2009, [email protected] 6565 http://hartenstein.de http://hartenstein.de [email protected] Nick Tredennicks Paradigm Shifts explain the differences Software Engineering resources: CPU fixedvariable software algorithm: 1 programming source needed Configware Engineering configware flowware 2009, [email protected]

resources: variable algorithm: variable 6666 2 programming sources needed http://hartenstein.de http://hartenstein.de [email protected] Machine model ASIC accelerator Our Contemporary Computer Machine Model resources programming property source property hardwired hardwired - sequencer programming state register source Software CPU hardwired programmable (instruction streams) Configware Flowware RPU

programmable (data accelerator programmable (configuration code) streams) program counter in CPU data counters in RAM data counters of reconfigurable address generators in asM (autosequencing) data memory blocks twin Paradigm Dichotomy the same language primitives! 2009, [email protected] 67 http://hartenstein.de http://hartenstein.de [email protected] Compilation: Software vs. Configware Software Engineeri ng source program Configwar C, FORTRAN e MATHLAB Engineerin placeme source nt & gprogram routing

mapper software compiler configware compiler datascheduler software code configware code flowware code 2009, [email protected] 68 http://hartenstein.de Co-Compilation http://hartenstein.de [email protected] C, FORTRAN, MATHLAB automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler datascheduler software code configware code flowware code 2009, [email protected] 69 http://hartenstein.de http://hartenstein.de [email protected]

Co-Compiler for Hardwired Anti Machine [e. g. Brodersen] source automatic SW / CW partitioner software compiler Software / Flowware CoCompiler software code 2009, [email protected] flowware compiler datascheduler flowware code 70 http://hartenstein.de http://hartenstein.de [email protected] su b do m ain ins tru of c PE tio st r : n

ea ms SE Software Engineering A Heliocentric CS Model time to space mapping issue PE Program Engineering The Generalization of Software Engineering 2009, [email protected] Flowware Engineering FE data (thsetreoatm hse*r subdomain) *) do not confuse with dataflow ! A Twin Paradigm Dual Dichotomy Approach. 71 http://hartenstein.de http://hartenstein.de

Time to Space Mapping [email protected] Machine model ASIC accelerator resources programming property source property hardwired hardwired David Parnas Software ace p s o t CPU hardwired time ap-ping programmable (instruction m streams) Configware Flowware RPU programmable (data accelerator programmable (configuration code) streams) loop turns 2 pipeline -

sequencer programming state register source Relativity Dichotomy . .. 0ies + C early 7 program counter data counters C 1967 The biggest payoff will come from Putting Old ideas into Practice and teaching people how to apply them properly. 2009, [email protected] 72 P i i O http://hartenstein.de Dual Paradigm http://hartenstein.de [email protected] Application Development

Roa high level languagethe P d map Sup ers to erc onal software/configware ompute r Juergen Beckers CoDe-X, 1996 co-compiler C language source Partitioner SW compiler CPU CW compiler software code configware code instructiondatastreamstreambased based rDPU rDPU rDPU rDPU CPU accelerator hardwired rDPU rDPU rDPU rDPU accelerator rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU 2009, [email protected] reconfigurable 73

de o lm a du Us P http://hartenstein.de http://hartenstein.de >> Outline << [email protected] The single-paradigm dictatorship Von Neumann vs. FPGA The Datastream Machine Model Avoiding address computation overhead The twin Paradigm approach Conclusions 2009, [email protected] 74 http://hartenstein.de http://hartenstein.de [email protected] Ways to implement an Algorithm RAM-based Hardware Software Configware + Flowwar e mixed 2009, [email protected]

von Neumannmachine singlecore multicore manycore datastream machine . manycore per se 75 http://hartenstein.de Flowware http://hartenstein.de [email protected] Flowware means parallelism resulting from time to space migration Flowware: scheduling data streams from a generalization of the systolic array supports any wild free form of pipe networks: spiral, zigzag, fork and join, and even more wild, unidirectional and fully or partially bidirectional, Fifos, stacks, registers, register files, RAM blocks... 2009, [email protected] 76 http://hartenstein.de http://hartenstein.de [email protected] Software Education (R)evolution: step by step, not overthrowing the SE scene by simultaneous dual domain co-education: traditional qualification in the time domain + lean qualification in the space domain lean ? n

ot = lean hardware modeling qualification to scare at a higher level of abstraction undergr away aduat e studen ts viable methodology for dual rail education (only a few % curricula need to be changed) 77 2009, [email protected] 7777 http://hartenstein.de http://hartenstein.de [email protected] RC versus Multicore RC = Reconfigurable Computing RC: speed-up often higher by orders of Sure ! magnitude RC: energy-efficiency often higher: very much, or, by orders of magnitude ? this is the Sure ! Multicore: silver bullet legacy software, We need both: Multicore and RC control-intensive applications, etc. 2009, [email protected]

78 http://hartenstein.de http://hartenstein.de [email protected] We need new courses We need undergraduate lab courses with HW / CW / SW partitioning We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW co-design We urgently need a Mead-&-Conway-like text book [R. H., Dagstuhl Seminar 03301,Germany, 2003] here its foreruner: but not yet twin paradigm 2009, [email protected] 79 2007 http://hartenstein.de http://hartenstein.de SERUM-RC* [email protected] We urgently need a Mead-&Conway-style new mass movement community Software Education Revolution for using Multicore - and RC* (SERUM-RC*)

*) Reconfigurable Computing We urgently need a Mead-&Conway-dimension text book on twin-paradigm programming education 2009, [email protected] 8080 http://hartenstein.de http://hartenstein.de [email protected] thank you 2009, [email protected] 81 http://hartenstein.de http://hartenstein.de [email protected] END 2009, [email protected] 82 http://hartenstein.de

Recently Viewed Presentations

  • Chapter 1

    Chapter 1

    Marketing Research Chapter 9 Lamb, Hair, McDaniel 2014-2015 © Cengage Learning 2015. All Rights Reserved.
  • Darren Strange - download.microsoft.com

    Darren Strange - download.microsoft.com

    "Green IT has reached critical mass. Virtually. ... energy-efficient, less-carbon-intensive business models, enterprises, value chains, products and services with reduced environmental impact." ... Centralize client/server power management with Group Policy. Desktop power management.
  • Introduction to Shakespeare and Macbeth Objective: to ...

    Introduction to Shakespeare and Macbeth Objective: to ...

    Introduction to Shakespeare and Macbeth Objective: to comprehend how context informs the narrative. [Context] Everything that surrounds the subject (time, place, culture, politics, history) Last modified by
  • Reading Log Lamm /Colling Act 1: Scene 1

    Reading Log Lamm /Colling Act 1: Scene 1

    Setting: Inverness, Macbeth's castle. Macbeth comes upon Banquo and Banquo's son Fleance after midnight as they make their way to bed. Macbeth and Banquo talk of the witches' predictions, and Macbeth again suggests a private talk with Banquo. After they...
  • Contract Law: Consideration - University of Waterloo

    Contract Law: Consideration - University of Waterloo

    Contract Law. The five essential elements of a contract are: An offer is made and accepted. There is mutual intent to enter into the contract. Consideration
  • File System Security - cs.fsu.edu

    File System Security - cs.fsu.edu

    File System Security Jason Eick and Evan Nelson File System Security's Future Example: Sun's ZFS Released in 2006 Marked a departure from file systems of previous years by integrating new methods of storage, access and security Has two advantages in...
  • Atoms have NO overall charge  ATOMS DO NOT

    Atoms have NO overall charge ATOMS DO NOT

    form when elements do not have complete sets of valence electrons. This usually occurs between a . metal . cation. and a . nonmetal . anion. Some elements achieve stable electron configurations through the transfer of electrons between atoms. Transfer...
  • Phylum Chordata: The Vertebrates

    Phylum Chordata: The Vertebrates

    Typhlosole in the intestine. This is a spirally coiled fold of the intestinal wall. In the Gnathostomes, it can be developed into a complex spiral valve. At least two vertical semicircular canals in the labyrinth True neuromasts in the sensory-line...