Computer Classes: Why they form, and what's new 'this' time ...

Computer Classes: Why they form, and what's new 'this' time ...

All the chips outside and around the PC what new platforms? Apps? Challenges, whats interesting, and what needs doing? Gordon Bell Bay Area Research Center Microsoft Corporation Copyright Gordon Bell & Jim Gray ISCA2000 Architecture changes when everyone and everything is mobile! Power, security, RF, WWW, display, data-types e.g. video & voice its the application of architecture! The architecture problem The apps

Data-types: video, voice, RF, etc. Environment: power, speed, cost The material: clock, transistors Performance its about parallelism Program & programming environment Network e.g. WWW and Grid Clusters Multiprocessors Storage, cluster, and network interconnect Processor and special processing Multi-threading and multiple processor per chip Instruction Level Parallelism vs Vector processors Copyright Gordon Bell & Jim Gray ISCA2000

IP On Everything Copyright Gordon Bell & Jim Gray ISCA2000 poochi Copyright Gordon Bell & Jim Gray ISCA2000 Sony Playstation export limiits Copyright Gordon Bell & Jim Gray ISCA2000 PC At An Inflection Point? It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance. They drive microprocessor competition! Non-PC PCs devices and Internet

Copyright Gordon Bell & Jim Gray ISCA2000 TV/AV Mobile Companions Consumer PCs The Dawn Of The PC-Plus Era, Not The Post-PC Era devices aggregate via PCs!!! Communications Automation & Security Household Management PC will prevail for the next decade as a dominant platform 2nd to smart, mobile devices

Moores Law increases performance; and alternatively reduces prices PC server clusters with low cost OS beat proprietary switches, smPs, and DSMs Home entertainment & control Very large disks (1TB by 2005) to store everything Screens to enhance use Mobile devices, etc. dominate WWW >2003! Voice and video become important apps! C = Commercial; C = Consumer Wheres the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security Moores Law, including network speed Scalability and high performance processing

Building them: Clusters vs DSM Structure: wheres the processing, memory, and switches (disk and ip/tcp processing) Micros: getting the most from the nodes Not ISAs: Change can delay Moore Law effect and wipe out software investment! Please, please, just interpret my object code! System on a chip alternatives apps drive Data-types (e.g. video, video, RF) performance, portability/power, and cost High Performance Computing A 60+ year view Copyright Gordon Bell & Jim Gray ISCA2000

High performance architecture/program timeline 1950 . 1960 . 1970 . Vtubes Trans. MSI(mini) 1980 . 1990 . Micro RISC 2000 nMicr Sequential programming---->-----------------------------(single execution stream)

in situ resources 100x in //sm geographically dispersed Grid Copyright Gordon Bell & Jim Gray NOW VLSCC ISCA2000 Computer types -------- Connectivity-------DSM SM vector SAN Netwrked Supers micros WAN/LAN GRID

Legion T3E SGI DSM Mainframes Condor SP2(mP) clusters & Multis BeowulfNOW SGI DSM WSs PCs NT clusters VPPuni NEC super NEC mP Cray XT (all mPv) Clusters Copyright Gordon Bell & Jim Gray ISCA2000 vector micros

Technical computer types WAN/LAN Netwrked Supers New SAN DSM SM NEC mP NEC super Old Cray XT T series World (all mPv) world: VPPuni Clustered GRID ( one Computing Legion SGI DSM program Mainframes

(multiple program SP2(mP) Condor clusters & Multis NOW stream) streams) Beowulf SGI DSM WSs PCs Dead Supercomputer Society Copyright Gordon Bell & Jim Gray ISCA2000 Dead Supercomputer Society

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor

Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1

Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics SCI Research c1985-1995 35 university and corporate R&D projects 2 or 3 successes

All the rest failed to work or be successful Copyright Gordon Bell & Jim Gray ISCA2000 How to build scalables? To cluster or not to cluster dont we need a single, shared memory? Copyright Gordon Bell & Jim Gray ISCA2000 Application Taxonomy Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters General purpose, nonparallelizable codes (PCs have it!)

Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs...) Database Database/TP Web Host Stream Audio/Video SNAP c1995 Scalable Network And Platforms A View of Computing in 2000+ We all missed the impact of WWW! Gordon Bell Copyright Gordon Bell & Jim Gray Network Platform

Jim Gray ISCA2000 Computing SNAP built entirely Person fromPerson PCs servers servers (PCs) (PCs) Portables Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation,

& servers ??? TC=TV+PC home ... (CATV or ATM or satellite) Legacy mainframes & Legacy minicomputers mainframe & terms servers & minicomputer servers & terminals A space, time (bandwidth), & generation scalable environment Copyright Gordon Bell & Jim Gray scalable computers built from PCs Centralized &Centralized

departmental uni& mP servers & departmental (UNIX & NT) servers buit from PCs ISCA2000 Bell Prize and Future Peak Tflops (t) 1000 100 10 *IBM Petaflops study target 1

NEC 0.1 CM2 0.01 0.001 0.0001 XMP NCube 1985 1990 1995 Copyright Gordon Bell & Jim Gray 2000 2005 2010 ISCA2000

Top 10 tpc-c Top two Compaq systems are: 1.1 & 1.5X faster than IBM SPs; 1/3 price of IBM 1/5 price of SUN Copyright Gordon Bell & Jim Gray ISCA2000 Courtesy of Dr. Thomas Sterling, Caltech Five Size scalable -- designed from a few Scalabilities with no bottlenecks components, Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale,

machine scale (#p), run time, implies speedup and efficiency, Copyright Gordon Bell & Jim Gray ISCA2000 Why I gave up on large smPs & DSMs Economics: Perf/Cost is lowerunless a commodity Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. DSMs NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. Arent scalable. Reliability requires clusters. Start there. They arent needed for most apps hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet. Copyright Gordon Bell & Jim Gray

ISCA2000 FVCORE Performance Finite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR 50 SX-5 30 SX-4 25 GFlops 20 MPI on SGI 15 MLP on SGI 10 5

Max C90-16 0 Max T3E 0 200 400 Number of SGI processors 600 Architectural Contrasts Vector vs Microprocessor Vector System 500Mhz Two results per clock Vector lengths fixed Vectors fed at high speed Microprocessor System 600Mhz

CPU CPU Vector registers 8 KBytes 1st & 2nd Lvl Caches 8 MBytes Memory Memory Two results per clock (Will be 4 in next Gen SGI) Vector lengths arbitrary Vectors fed at low speed Cache based systems are nothing more than vector processors with a highly programmable vector register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne

out in fact at NASA Ames. limited scalability: mP, uniform memory access mPs continue to be the main mP mainframe, super mP bus based multi: mini, W/S micros ?? DEC, Encore, Sequent, Stratus, SGI, SUN, etc. Convex, Cray, Fujistu, IBM,

Hitachi, NEC line mainframes & supers scalable, mP: smP, non-uniform memory access 1st smP 0 cache 1995? Convergence to smP DSM some cache smC next gen. DSM=>smP smC med-coarse Fujitsu, Intel, Meiko, NCUBE, grain

smC: very coarse grain 1995? TMC; 1985-1994 Cosmic Cube, iPSC 1, NCUBE, Transputer-based networked workstations: smC smC fine-grain DSM?? high bandwith switch , comm. protocols e.g. ATM Apollo, SUN, HP, etc.

Natural evolution 1995? one architecture WS Micros, fast switch DASH, Convex, Cray T3D, SCI Cache for locality Cm* ('75), Butterfly ('85), Cedar ('88) experimental, 1st smC scalable, hypercube multicomputer: smC, Transputer non uniform memory (grid) access

note, only two structures: 1. shared memory mP with mP ring-based uniform & non-uniform memory access; and multi 2. networked workstations, shared nothing smC coarse gr. clusters WSs Clusters via special switches 1994 &ATM 1995 smP all cache arch. KSR Allcache next gen. smP research e.g. DDM, DASH+ Evolution of scalable multiprocessors, multicomputers, & workstations to shared

memory computers Mosaic-C, J-machine Jim, what are the architectural challenges for clusters? WANS (and even LANs) faster than backplanes at 40 Gbps End of busses (fc=100 MBps) except on a chip What are the building blocks or combinations of processing, memory, & storage? Infiniband starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed. Copyright Gordon Bell & Jim Gray ISCA2000 What is the basic structure of these scalable systems?

Overall Disk connection especially wrt to fiber channel SAN, especially with fast WANs & LANs Copyright Gordon Bell & Jim Gray ISCA2000 Modern scalable switches also hide a supercomputer Scale from <1 to 120 Tbps of switch capacity 1 Gbps ethernet switches scale to 10s of Gbps SP2 scales from 1.2 Gbps Copyright Gordon Bell & Jim Gray ISCA2000 GB plumbing from the baroque: evolving from the 2 dance-hall model Mp S Pc :

| : | S.fc Ms | : | S.Cluster | S.WAN MpPcMs S.Lan/Cluster/Wan : Copyright Gordon Bell & Jim Gray ISCA2000 SNAP Architecture---------- Copyright Gordon Bell & Jim Gray ISCA2000 ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target:

MicroDrive:1.7 x 1.4 x 0.2 2006: ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/ yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moores law 16 Mbytes; ; 1.6 Gflops; 6.4 Gops 10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops Copyright Gordon Bell & Jim Gray ISCA2000 The Disk Farm? or a System On a Card? 14" The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc

50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!! Copyright Gordon Bell & Jim Gray ISCA2000 MapRedmond/Seattle, of Gray Bell Prize results WA single-thread single-stream tcp/ip New York via 7 hops desktop-to-desktop Win 2K out of the box performance* Arlington, VA San Francisco, CA 5626 km 10 hops

Copyright Gordon Bell & Jim Gray ISCA2000 Ubiquitous 10 GBps SANs in 5 years 1Gbps Ethernet are reality now. Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM, 10 Gbps x4 WDM deployed now 1 GBps (OC192) 3 Tbps WDM working in lab

In 5 years, expect 10x, wow!! 120 MBps (1Gbps) 80 MBps Copyright Gordon Bell & 20 JimMBps Gray 5 MBps 40 MBps ISCA2000 The Promise of SAN/VIA:10x in 2 years Yesterday:

10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 s Now Wires are 10x faster 250 Time s to Send 1KB 200 150 Transmit receivercpu sender cpu 100 Myrinet, Gbps Ethernet, ServerNet,

Fast user-level communication - tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WAN Copyright Gordon Bell & Jim Gray 50 0 100Mbps Gbps SAN ISCA2000 Processor improvements 90% of ISCAs focus

Copyright Gordon Bell & Jim Gray ISCA2000 Copyright Gordon Bell & Jim Gray ISCA2000 We get more of everything Copyright Gordon Bell & Jim Gray ISCA2000 Performance (VAX 780s) 100 Mainframes, minis, micros, and risc Performance vs Time for Several Computers S RI 10

1.0 /yr % 5 1 L EC 8600 TTL 780 5 Mhz C % 60 r y / Mips 25 mhz o

| | MIPS (8 Mhz) 4K Mips (65 Mhz) o 9000 uVAX 6K (CMOS) MV10K C S I C uVAX S O r CMOS Will RISC continue on a M 68K C %/y 60%, (x4 / 3 years)? 38 Moore's speed law?

0.1 1980 1985 1990 Computer ops/sec x word length / $ 1.E+09 doubles every 1.0 1.E+06 .=1.565^(t-1959.4) 1.E+03 y = 1E-248e 1.E+00 1.E-03 0.2918x doubles every 2.3 doubles every 7.5 1.E-06 Copyright Gordon

Bell & Jim1920 Gray 1880 1900 1940 1960 1980ISCA2000 2000 Cray T90 1000 100 10 Cray 2 Cray Y-MP Cray C90 Alpha RS6000/590 Cray X-MP Alpha RS6000/540 Cray 1S i860 Supers

Micros R2000 1 0.1 0.01 8087 80387 6881 80287 19 80 19 82 19 86 19 88 19 90 19 92 19 94

19 96 19 98 Performance in Mflop/s Growth of microprocessor performance 10000 Copyright Gordon Bell & Jim Gray ISCA2000 Albert Yu predictions 96 When Clock (MHz) MTransistors Mops Die (sq. in.) 2000 900 40 2400 1.1

Copyright Gordon Bell & Jim Gray 2006 4000 350 20,000 1.4 4.4x 8.75x 8.3x 1.3x ISCA2000 Processor Limit: DRAM Gap Moores Law 100 10 1 Proc 60%/ yr.. Processor-Memory Performance Gap: (grows 50% / year) DRAM

DRAM 7%/yr.. CPU 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance

1000 Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors Copyright Gordon Bell & Jim Gray *Taken from Patterson-Keeton Talk to SigMod ISCA2000 The memory gap Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays Or alternatively, multi-threading (MTA) Vector processors with a supporting memory system System-on-a-chip to reduce chip boundary crossings Copyright Gordon Bell & Jim Gray

ISCA2000 If system-on-a-chip is the answer, what is the problem? Small, high volume products Phones, PDAs, Toys & games (to sell batteries) Cars Home appliances TV & video Communication infrastructure Plain old computers and portables Copyright Gordon Bell & Jim Gray ISCA2000

SOC Alternatives not including C/C++ CAD Tools The blank sheet of paper: FPGA Auto design of a basic system: Tensilica Standardized, committee designed components*, cells, and custom IP Standard components including more application specific processors *, IP add-ons and custom One chip does it all: SMOP *Processors, Memory, Communication & Memory Links, Copyright Gordon Bell & Jim Gray ISCA2000 Xilinx 10Mg, 500Mt, .12 mic Copyright Gordon Bell & Jim Gray ISCA2000 Free 32 bit processor core Copyright Gordon Bell & Jim Gray

ISCA2000 System-on-a-chip alternatives FPGA Sea of un-committed gate arrays Compile Unique processor for a system every app Systolic | Many pipelined or parallel array processors + custom DSP | Spec. purpose processors VLIW cores + custom Pc & Mp. Gen. Purpose cores. Specialized by I/O, etc. ASICS Universal Multiprocessor array, Micro programmable I/o Xylinx, Altera Tensillica TI

IBM, Intel, Lucent Cradle Cradle: Universal Microsystem trading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systems Software : Hardware Single part for all apps App [email protected] run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire $4 per flops; 150 mW/Gflops



PROG I/O M M M M S S S S P P P P D RAM Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property Recapping the challenges Scalable systems Latency in a distributed memory Structure of the system and nodes Network performance for OC192 (10 Gbps)

Processing nodes and legacy software Mobile systems power, RF, voice, I/0 Design time! Copyright Gordon Bell & Jim Gray ISCA2000 The End Copyright Gordon Bell & Jim Gray ISCA2000

Recently Viewed Presentations

  • Relativity Chapter 1 - College of Sciences

    Relativity Chapter 1 - College of Sciences

    All inertial reference frames are equivalent! Another way of stating this principle is that . only relative motion can be detected. Transformation Equations . If you know what an observer in a particular reference frame observes then you can predict...
  • Innovation leadership - The World

    Innovation leadership - The World

    Innovation, leadership and professional learning: choices and consequences. Emeritus Professor Judyth Sachs. Macquarie University. Questions shaping the presentation. What shapes our practice? How can we think about leadership and professional learning in different ways?
  • Marian Forgan Senior Verifier  Introduction of Revised Award

    Marian Forgan Senior Verifier Introduction of Revised Award I 0303 333 0330. Marian Forgan. Senior Verifier. Introduction of Revised Award. Qualification Verification Summary Report. ... Develop the use of SOLAR. Areas for Development. Adhering to centre's IV processes. Looking Forward. 3 HN visits arranged. 6 NC visits...
  • Yale FBO Communications

    Yale FBO Communications

    FlowVisor. Allocate a fixed portion of tasks and resources. There are significant previous studies on the debugging and. evaluation of distributed systems. (Click) Compared with them, ShadowStream is the first system based on the. key observation that both the Repair...
  • Formal Methods in Practice: Analysis and Application of ...

    Formal Methods in Practice: Analysis and Application of ...

    Formal Methods in Practice: Analysis and Application of Formal Modeling to Information Systems. Peter Geer. FM have been used successfully in industry - many case studies
  • What Mosaic Offers Shareholders: - Mosaic Capital

    What Mosaic Offers Shareholders: - Mosaic Capital

    Mosaic Capital is a diversified investment company that owns and grows a portfolio of established businesses operating in niche markets. Who We Are. Invest in companies that have a demonstrated history of: Create value through internal and external growth, diversification,...
  • Slajd 1 - EduPage

    Slajd 1 - EduPage

    Odkrycie Ameryki Kolumb uważał, że istnieje droga morska do Indii i postanowił ją odkryć. Na wody Oceanu Atlantyckiego wyruszył 12 października 1492r. na pokładzie Santa Marii odbył swą pierwszą wyprawę. Gdy dobił do lądu był przekonany, iż dotarł do Indii....

    Transcoelomic spread directly within the pelvis and abdomen . Risk of Malignancy Index = UxMxCA125 (ultrasound score and menopause status) Investigations. USS. CA 125 - if >35 IU/mL, ultrasound of pelvis and abdomen. CT. Alpha fetoprotien and hCG in <40...