CPSC 367: Parallel Computing

CPSC 367: Parallel Computing

Chapter 2 Parallel Architectures Outline Some chapter references Brief review of complexity Terminology for comparisons Interconnection networks Processor arrays Multiprocessors Multicomputers

Flynns Taxonomy moved to Chpt 1 2 Some Chapter References Selim Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989 (earlier textbook). G. C. Fox, What Have We Learnt from Using Real Parallel Machines to Solve Real Problems? Technical Report C3P-522, Cal Tech, December 1989. (Included in part in more recent books co-authored by Fox.) A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, 2003 (first edition 1994), Addison Wesley. Harry Jordan, Gita Alaghband, Fundamentals of Parallel Processing: Algorithms, Architectures, Languages, Prentice Hall, 2003, Ch 1, 3-5. 3

References - continued Gregory Pfsiter, In Search of Clusters: The ongoing Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses details of some serious problems that MIMDs incur). Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill,2004 (Current Textbook), Chapter 2. Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, Ch. 1,2 Sayed H. Roosta, Parallel Processing & Parallel Algorithms: Theory and Computation, Springer Verlag, 2000, Chpt 1. Wilkinson & Allen, Parallel Programming: Techniques and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-2. 4 Brief Review Complexity Concepts Needed for Comparisons

Whenever we define a counting function, we usually characterize the growth rate of that function in terms of complexity classes. Definition: We say a function f(n) is in O(g(n)), if (and only if) there are positive constants c and n0 such that 0 f(n) cg(n) for n n0 O(n) is read as big-oh of n. This notation can be used to separate counts into complexity classes that characterize the size of the count. We can use it for any kind of counting functions such as timings, bisection widths, etc. 5 Big-Oh and Asymptotic Growth Rate The big-Oh notation gives an upper bound on the (asymptotic) growth rate of a function The statement f(n) is O(g(n)) means that the growth

rate of f(n) is no more than the growth rate of g(n) We can use the big-Oh notation to rank functions according to their growth rate Assume: f(n) is O(g(n)) g(n) is O(f(n)) g(n) grows faster Yes No f(n) grows faster No

Yes Same growth Yes Yes 6 Relatives of Big-Oh big-Omega f(n) is (g(n)) if there is a constant c > 0 and an integer constant n0 1 such that f(n) cg(n) for n n0 Intuitively, this says up to a constant factor, f(n) asymptotically is greater than or equal to g(n) big-Theta

f(n) is (g(n)) if there are constants c > 0 and c > 0 and an integer constant n0 1 such that 0 cg(n) f(n) cg(n) for n n0 Intuitively, this says up to a constant factor, f(n) and g(n) are asymptotically the same. Note: These concepts are covered in algorithm courses 7 Relatives of Big-Oh little-oh f(n) is o(g(n)) if, for any constant c > 0, there is an integer constant n0 0 such that 0 f(n) < cg(n) for n n0 Intuitively, this says f(n) is, up to a constant, asymptotically strictly less than g(n), so f(n) (g(n)). little-omega f(n) is (g(n)) if, for any constant c > 0, there is an integer constant n0 0 such that f(n) > cg(n) 0 for n

n0 Intuitively, this says f(n) is, up to a constant, asymptotically strictly greater than g(n), so f(n) (g(n)). These are not used as much as the earlier definitions, but they round out the picture. 8 Summary for Intuition for Asymptotic Notation big-Oh f(n) is O(g(n)) if f(n) is asymptotically less than or equal to g(n) big-Omega f(n) is (g(n)) if f(n) is asymptotically greater than or equal to g(n) big-Theta f(n) is (g(n)) if f(n) is asymptotically equal to g(n) little-oh

f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n) little-omega f(n) is (g(n)) if is asymptotically strictly greater than g(n) 9 A CALCULUS DEFINITION OF O, (often easier to use) Definition: Let f and g be functions defined on the positive integers with nonnegative values. We say g is in O(f) if and only if lim g(n)/f(n) = c n -> for some nonnegative real number c--- i.e. the limit exists and is not infinite. Definition: We say f is in (g) if and only if

f is in O(g) and g is in O(f) Note: Often use L'Hopital's Rule to calculate the limits you need. 10 Why Asymptotic Behavior is Important 1) Allows us to compare counts on large sets. 2) Helps us understand the maximum size of input that can be handled in a given time, provided we know the environment in which we are running. 3) Stresses the fact that even dramatic speedups in hardware do not overcome the handicap of an asymtotically slow algorithm. 11

Recall: ORDER WINS OUT (Example from Baases Algorithms Text) The TRS-80 Main language support: BASIC - typically a slow running interpreted language For more details on TRS-80 see: http://mate.kjsl.com/trs80/ The CRAY-YMP Language used in example: FORTRAN- a fast running language For more details on CRAY-YMP see: http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaq p1.html#TOC3 12 CRAY YMP with FORTRAN complexity is 3n3 n is:

TRS-80 with BASIC complexity is 19,500,000n microsecond (abbr sec) One-millionth of a second. millisecond (abbr msec) One-thousandth of a second. 10 3 microsec 100 3 millisec 200 millisec 2 sec

1000 3 sec 20 sec 2500 50 sec 50 sec 10000 49 min 3.2 min

1000000 95 years 5.4 hours 13 Interconnection Networks Uses of interconnection networks Connect processors to shared memory Connect processors to each other Interconnection media types Shared medium Switched medium

Different interconnection networks define different parallel machines. The interconnection networks properties influence the type of algorithm used for various machines as it affects how data is routed. 14 Shared versus Switched Media 15 Shared Medium Allows only message at a time

Messages are broadcast Each processor listens to every message Before sending a message, a processor listen until medium is unused Collisions require resending of messages Ethernet is an example 16 Switched Medium Supports point-to-point messages between pairs of processors Each processor is connected to one switch Advantages over shared media Allows multiple messages to be sent simultaneously Allows scaling of the network to accommodate the increase in processors

17 Switch Network Topologies View switched network as a graph Vertices = processors or switches Edges = communication paths Two kinds of topologies Direct Indirect 18 Direct Topology Ratio of switch nodes to processor nodes is 1:1 Every switch node is connected to 1 processor node

At least 1 other switch node Indirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect to other 19 switches Terminology for Evaluating Switch Topologies We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness in implementing efficient parallel algorithms on a machine with a given network. These are

The diameter The bisection width The edges per node The constant edge length Well define these and see how they affect algorithm choice. Then we will investigate several different topologies and see how these characteristics are 20 evaluated. Terminology for Evaluating Switch Topologies Diameter Largest distance between two

switch nodes. Low diameter is good It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes. 21 Terminology for Evaluating Switch Topologies Bisection width The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves (within 1 node, if the number of processors is odd.) High bisection width is good. In algorithms requiring large amounts of data

movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm, Actually proving what the bisection width of a network is can be quite difficult. 22 Terminology for Evaluating Switch Topologies Number of edges / node It is best if the number of edges/node is a constant independent of network size as that allows more scalability of the system to a larger number of nodes. Degree is the maximum number of edges per node. Constant edge length? (yes/no) Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum

edge length is a constant independent of network size. 23 Evaluating Switch Topologies Many have been proposed and analyzed. We will consider several well known ones: 2-D mesh linear network binary tree

hypertree butterfly hypercube shuffle-exchange Those in yellow have been used in commercial parallel computers. 24 2-D Meshes Note: Circles represent switches and squares represent processors in all these slides. 25 2-D Mesh Network Direct topology

Switches arranged into a 2-D lattice or grid Communication allowed only between neighboring switches Torus: Variant that includes wraparound connections between switches on edge of mesh 26 Evaluating 2-D Meshes (Assumes mesh is a square) n = number of processors Diameter: (n1/2) Places a lower bound on algorithms that require processing with arbitrary nodes sharing data. Bisection width: (n1/2)

Places a lower bound on algorithms that require distribution of data to all nodes. Max number of edges per switch: 4 (note: this is the degree) Constant edge length? Yes Does this scale well? Yes 27 Linear Network Switches arranged into a 1-D mesh Corresponds to a row or column of a 2-D mesh Ring : A variant that allows a wraparound connection between switches on the end. The linear and ring networks have many applications

Essentially supports a pipeline in both directions Although these networks are very simple, they support many optimal algorithms. 28 Evaluating Linear and Ring Networks Diameter Linear : n-1 or (n) Ring: n/2 or (n) Bisection width: Linear: 1 or (1)

Ring: 2 or (1) Degree for switches: 2 Constant edge length? Yes Does this scale well? Yes 29 Binary Tree Network Indirect topology n = 2d processor nodes, 2n-1 switches, where d= 0,1,... is the number of levels i.e. 23 = 8 processors on bottom and 2(n) 1 = 2(8) 1 = 15 switches

30 Evaluating Binary Tree Network Diameter: 2 log n Note- this is small Bisection width: 1, the lowest possible number Degree: 3 Constant edge length? No Does this scale well? No 31 Hypertree Network (of degree 4 and

depth 2) (a) Front view: 4-ary tree of height 2 (b) Side view: upside down binary tree of height d (c) Complete network 32 Hypertree Network Indirect topology Note- the degree k and the depth d must be specified. This gives from the front a k-ary tree of height d. From the side, the same network looks like an upside down binary tree of height d. Joining the front and side views yields the complete network.

33 Evaluating 4-ary Hypertree with n =16 processors Diameter: log n shares the low diameter of binary tree Bisection width: n/2 Large value - much better than binary tree Edges / node: 6 Constant edge length? No

34 Butterfly Network Indirect topology n = 2d processor nodes connected by n(log n + 1) switching nodes As complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm! Note: The bottom row of switches is normally identical with the top row. A 23 = 8 processor

butterfly network with 8*4=32 switching nodes 0 1 2 3 4 5 6 7

R ank 0 0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7

R ank 1 1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

R ank 2 2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

R ank 3 3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

35 The rows are called ranks. Building the 23 Butterfly Network There are 8 processors. Have 4 ranks (i.e. rows) with 8 switches per rank. Connections: Node(i,j), for i > 0, is connected to two nodes on rank i-1, namely node(i-1,j) and node(i-1,m), where m is the integer found by inverting the ith most significant bit in the binary d-bit representation of j. For example, suppose i = 2 and j = 3. Then node (2,3) is connected to node (1,3). To get the other connection, 3 = 0112. So, flip 2nd significant bit i.e. 0012 and connect node(2,3) to node(1,1) --- NOTE: There is an error on pg 32 on this example.

36 Why It Is Called a Butterfly Network Walk cycles such as node(i,j), node(i-1,j), node(i,m), node(i-1,m), node(i,j) where m is determined by the bit flipping as shown and you see a butterfly: 37 Butterfly Network Routing Send message from processor 2 to processor 5. Algorithm: 0 means ship left; 1 means ship right. 1) 5 = 101. Pluck off

leftmost bit 1 and send 01msg to right. 2) Pluck off leftmost bit 0 and send 1msg to left. 3) Pluck off leftmost bit 1 and send msg to right. 38 Evaluating the Butterfly Network Diameter: log n Bisection width: 0 1

2 3 4 5 6 7 R ank 0 0 ,0 0 ,1

0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7 R ank 1 1 ,0 1 ,1

1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7 No as rank decreases,R a n k 2 grows exponentially 2 ,0 2 ,1

2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7 R ank 3 3 ,0 3 ,1

3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7 n/2 Edges per node: 4 (even for d 3)

Constant edge length? 39 Hypercube (or binary n-cube) n = 2d processors and n switch nodes 1110 0110 0111 1010 0010 1111

1011 0011 1100 0100 0101 1000 0000 1101 1001 0001 Butterfly with the columns of switch nodes collapsed

into a single node. 40 Hypercube (or binary n-cube) n = 2d processors and n switch nodes Direct topology 2 x 2 x x 2 mesh Number of nodes is a power of 2 Node addresses 0, 1, , 2k-1 Node i is connected to k nodes whose addresses differ from i in exactly one bit position. Example: k = 0111 is

connected to 1111, 0011, 0101, and 0110 1110 0110 0111 1010 0010 1111 1011 0011 1100 0100

0101 1000 0000 1101 1001 0001 41 Growing a Hypercube Note: For d = 4, it is a 4dimensional cube. 42

Evaluating Hypercube Network Diameter: log n Bisection width: n/2 Edges per node: log n Constant edge length? No. The length of the longest edge increases as n increases. 1110 0110

0111 1010 0010 1111 1011 0011 1100 0100 0101 1000 0000

1101 1001 0001 43 Routing on the Hypercube Network Example: Send a message from node 2 = 0010 to node 5 = 0101 The nodes differ in 3 bits so the shortest path will be of length 3. One path is 0010 0110

0100 0101 obtained by flipping one of the differing bits at each step. As with the butterfly network, bit flipping helps you route on this network. 1110 0110 0111 1010 0010

1111 1011 0011 1100 0100 0101 1000 0000 1101 1001 0001

44 A Perfect Shuffle A permutation that is produced as follows is called a perfect shuffle: Given a power of 2 cards, numbered 0, 1, 2, ..., 2d -1, write the card number with d bits. By left rotating the bits with a wrap, we calculate the position of the card after the perfect shuffle. Example: For d = 3, card 5 = 101. Left rotating and wrapping gives us 011. So, card 5 goes to position 3. Note that card 0 = 000 and card 7 = 111, stay in position. 45 Shuffle-exchange Network Illustrated

0 1 2 3 4 5 6 Direct topology Number of nodes is a power of 2 Nodes have addresses 0, 1, , 2d-1 Two outgoing links from node i

Shuffle link to node LeftCycle(i) Exchange link between node i and node i+1 when i is even 7 46 Shuffle-exchange Addressing 16 processors 0000 0001 0010 0011

0100 0101 0110 0111 1000 1001 1010 1011 1100

1101 1110 1111 No arrows on line segment means it is bidirectional. Otherwise, you must follow the arrows. Devising a routing algorithm for this network is interesting and will be a homework problem. 47 Evaluating the Shuffle-exchange Diameter: 2log n - 1 Bisection width: n / log n

Edges per node: 3 Constant edge length? No 0000 0001 0010 0011 0100 0101

0110 0111 1000 1001 1010 1011 1100 1101 1110

1111 48 Two Problems with Shuffle-Exchange Shuffle-Exchange does not expand well A large shuffle-exchange network does not compose well into smaller separate shuffle exchange networks. In a large shuffle-exchange network, a small percentage of nodes will be hot spots They will encounter much heavier traffic Above results are in dissertation of one of Batchers students. 49

Comparing Networks All have logarithmic diameter except 2-D mesh Hypertree, butterfly, and hypercube have bisection width n / 2 All have constant edges per node except hypercube Only 2-D mesh, linear, and ring topologies keep edge lengths constant as network size increases Shuffle-exchange is a good compromise- fixed number of edges per node, low diameter, good bisection width. However, negative results on preceding slide also need to be considered. 50 Alternate Names for SIMDs Recall that all active processors of a SIMD computer must simultaneously access the same

memory location. The value in the i-th processor can be viewed as the i-th component of a vector. SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently. 51 SIMD Computers SIMD computers that focus on vector operations Support some vector and possibly matrix operations in hardware Usually limit or provide less support for nonvector type operations involving data in the vector components. General purpose SIMD computers

Support more traditional type operations (e.g., other than for vector/matrix data types). Usually also provide some vector and possibly matrix operations in hardware. 52 Pipelined Architectures Pipelined architectures are sometimes considered to be SIMD architectures See pg 37 of Textbook & pg 8-9 Jordan et. al. Vector components are entered successively into first processor in pipeline. The i-th processor of the pipeline receives the output from the (i-1)th processor. Normal operations in each processor are much larger (coarser) in pipelined computers than in true SIMDs

Pipelined somewhat SIMD in nature in that synchronization is not required. 53 Why Processor Arrays? Historically, high cost of control units Scientific applications have data parallelism 54 Data/instruction Storage Front end computer Also called the control unit Holds and runs program Data manipulated sequentially Processor array Data manipulated in parallel

55 Processor Array Performance Performance: work done per time unit Performance of processor array Speed of processing elements Utilization of processing elements 56 Performance Example 1 1024 processors Each adds a pair of integers in 1 sec (1 microsecond or one millionth of second or 10-6 second.) What is the performance when adding two 1024-element vectors (one per

processor)? 9 Performance 10241operations 1 . 024 10 ops / sec sec 57 Performance Example 2 512 processors Each adds two integers in 1 sec What is the performance when adding two vectors of length 600?

Since 600 > 512, 88 processor must add two pairs of integers. The other 424 processors add only a single pair of integers. 58 Example of a 2-D Processor Interconnection Network in a Processor Array Each VLSI chip has 16 processing elements. Each PE can simultaneously send a value to a neighbor. PE = processor element 59 SIMD Execution Style

The traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]: The sequential processor that broadcasts the commands to the rest of the processors is called the front end or control unit. The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel. The front end normally executes the sequential portions of the program. Each processing element has a local memory that can not be directly accessed by the host or other processing elements. 60 SIMD Execution Style Collectively, the individual memories of the

processing elements (PEs) store the (vector) data that is processed in parallel. When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. Although the PEs execute in parallel, some units can be allowed to skip any particular instruction. 61 Masking on Processor Arrays All the processors work in lockstep except those that are masked out (by setting mask register). The conditional if-then-else is different for processor arrays than sequential version Every active processor tests to see if its data meets the negation of the boolean condition.

If it does, it sets its mask bit so those processors will not participate in the operation initially. Next the unmasked processors, execute the THEN part. Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the the ELSE part. 62 if (COND) then A else B 63 if (COND) then A else B 64 if (COND) then A else B

65 SIMD Machines An early SIMD computer designed for vector and matrix processing was the Illiac IV computer built at the University of Illinois See Jordan et. al., pg 7 The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computers See Akl pg 8-12 and [Quinn, 94] The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD See [Jordan, et al, p7], [Quinn 94, pg 61-2], and

[Quinn 2004, pg37). 66 SIMD Machines Quinn [1994, pg 63-67] discusses the CM-2 Connection Machine and a smaller & updated CM-200. Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO ASPRO is a small second generation STARAN used by the Navy in the spy planes. Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport. 67

Todays SIMDs Many SIMDs are being embedded in SISD machines. Others are being build as part of hybrid architectures. Others are being build as special purpose machines, although some of them could classify as general purpose. Much of the recent work with SIMD architectures is proprietary. 68 A Company Building Inexpensive SIMD WorldScape is producing a COTS (commodity off the shelf) SIMD Not a traditional SIMD as the hardware doesnt synchronize every step.

Hardware design supports efficient synchronization Their machine is programmed like a SIMD. The U.S. Navy has observed that their machines process radar a magnitude faster than others. There is quite a bit of information about their work at http://www.wscape.com 69 An Example of a Hybrid SIMD Embedded Massively Parallel Accelerators Systola 1024: PC add-on board with 1024 processors Fuzion 150: 1536 processors on a single chip Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan (This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)

70 Hybrid Architecture Systola Systola Systola Systola Systola Systola Systola Systola 1024 1024 1024 1024 1024 1024 1024 1024 High High speed speed Myrinet Myrinet switch

switch Systola Systola Systola Systola Systola Systola Systola Systola 1024 1024 1024 1024 1024 1024 1024 1024 combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer 71 Architecture of Systola 1024 Instruction Systolic Array:

32 32 mesh of processing elements wavefront instruction execution RAM NORTH RAM WEST program memory host computer bus Controller ISA Interface processors 72 SIMDs Embedded in SISDs

Intel's Pentium 4 includes what they call MMX technology to gain a significant performance boost IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine. Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code. This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering. 73 Special Purpose SIMDs in the Bioinformatics Arena Parcel

Acquired by Celera Genomics in 2000 Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capability Supports over a million processors GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing TimeLogic, Inc Has DeCypher, a reconfigurable SIMD 74 Advantages of SIMDs Reference: [Roosta, pg 10] Less hardware than MIMDs as they have only one control unit. Control units are complex.

Less memory needed than MIMD Only one copy of the instructions need to be stored Allows more data to be stored in memory. Less startup time in communicating between PEs. 75 Advantages of SIMDs Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug. Similar to sequential programming Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions.

MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead. 76 Advantages of SIMDs During a communication operation between PEs, PEs send data to a neighboring PE in parallel and in lock step No need to create a header with routing information as routing is determined by program steps. the entire communication operation is executed synchronously A tight (worst case) upper bound for the time for this operation can be computed. Less complex hardware in SIMD since no

message decoder is needed in PEs MIMDs need a message decoder in each PE. 77 SIMD Shortcomings (with some rebuttals) Claims are from our textbook [i.e., Quinn 2004]. Similar statements are found in [Grama, et. al]. Claim 1: Not all problems are data-parallel While true, most problems seem to have data parallel solutions. In [Fox, et.al.], the observation was made in their study of large parallel applications that most were data parallel by nature, but often had points where significant branching occurred. 78

SIMD Shortcomings (with some rebuttals) Claim 2: Speed drops for conditionally executed branches Processors in both MIMD & SIMD normally have to do a significant amount of condition testing MIMDs processors can execute multiple branches concurrently. For an if-then-else statement with execution times for the then and else parts being roughly equal, about of the SIMD processors are idle during its execution With additional branching, the average number of inactive processors can become even higher. With SIMDs, only one of these branches can be executed at a time. This reason justifies the study of multiple SIMDs (or MSIMDs). 79

SIMD Shortcomings (with some rebuttals) Claim 2 (cont): Speed drops for conditionally executed code In [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8. The cost of the extremely simple processors used in a SIMD are extremely low Programmers used to worry about full utilization of memory but stopped this after memory cost became insignificant overall. 80 SIMD Shortcomings (with some rebuttals)

Claim 3: Dont adapt to multiple users well. This is true to some degree for all parallel computers. If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by sharing This reason also justifies the study of multiple SIMDs (or MSIMD). SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research. 81 SIMD Shortcomings (with some rebuttals) Claim 4: Do not scale down well to starter systems that are affordable. This point is arguable and its truth is likely to vary rapidly over time

WorldScape/ClearSpeed currently sells a very economical SIMD board that plugs into a PC. 82 SIMD Shortcomings (with some rebuttals) Claim 5: Requires customized VLSI for processors and expense of control units has dropped Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS Expense of PCs (with control units) has dropped significantly However, reliance on COTS has fueled the success of low level parallelism provided by clusters and restricted new innovative parallel architecture research for well over a decade. 83

SIMD Shortcomings (with some rebuttals) Claim 5 (cont.) There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending. Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant. 84 Multiprocessors Multiprocessor: multiple-CPU computer

with a shared memory Same address on two different CPUs refers to the same memory location Avoids three cited problems for SIMDs Can be built from commodity CPUs Naturally support multiple users Maintain efficiency in conditional code 85 Centralized Multiprocessor 86 Centralized Multiprocessor

Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs Uniform memory access (UMA) multiprocessor Also called a symmetrical multiprocessor (SMP) 87 Private and Shared Data Private data: items used only by a single processor Shared data: values used by multiple processors In a centralized multiprocessor (i.e. SMP), processors communicate via shared data

values 88 Problems Associated with Shared Data The cache coherence problem Replicating data across multiple caches reduces contention among processors for shared data values. But - how can we ensure different processors have the same value for same address? The cache coherence problem is when an obsolete value is still stored in a processors cache. 89 Write Invalidate Protocol Most common solution to cache coherency

1. Each CPUs cache controller monitors (snoops) the bus & identifies which cache blocks are requested by other CPUs. 2. A PE gains exclusive control of data item before performing write. 3. Before write occurs, all other copies of data item cached by other PEs are invalidated. 4. When any other CPU tries to read a memory location from an invalidated cache block, a cache miss occurs It has to retrieve updated data from memory 90 Cache-coherence Problem

Memory X 7 Cache Cache CPU A CPU B 91 Cache-coherence Problem Memory Read from memory is not a problem.

X 7 7 CPU A CPU B 92 Cache-coherence Problem Memory X 7 7 CPU A

7 CPU B 93 Cache-coherence Problem Write to main memory is a problem. Memory X 2 7 CPU A 2

CPU B 94 Write Invalidate Protocol A cache control monitor snoops the bus to see which cache block is being requested by other processors. X 7 7 CPU A 7

CPU B 95 Write Invalidate Protocol X 7 Intent to write X 7 CPU A 7 CPU B Before a write

can occur, all copies of data at that address are declared invalid. 96 Write Invalidate Protocol X 7 Intent to write X 7 CPU A CPU B 97 Write Invalidate Protocol

X 2 2 CPU A When another processor tries to read from this location in cache, it receives a cache miss error and will have to refresh from main memory. CPU B

98 Synchronization Required for Shared Data Mutual exclusion Definition: At most one process can be engaged in an activity at any time. Example: Only one processor can write to the same address in main memory at the same time. We say that process must mutually exclude all others while it performs this write. Barrier synchronization Definition: Guarantees that no process will proceed beyond a designated point (called the barrier) until every process reaches that point. 99

Distributed Multiprocessor Distributes primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allows greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Local memory access time is fast Non-local memory access time can vary Distributed memories have one logical address space 100 Distributed Multiprocessors 101 Cache Coherence

Some NUMA multiprocessors do not support it in hardware Only instructions and private data are stored in cache Policy creates a large memory access time variance Implementation more difficult No shared memory bus to snoop Directory-based protocol needed 102 Directory-based Protocol Distributed directory contains information about cacheable memory blocks One directory entry for each cache block Each entry has Sharing status

Which processors have copies 103 Sharing Status Uncached -- (denoted by U) Block not in any processors cache Shared (denoted by S) Cached by one or more processors Read only Exclusive (denoted by E) Cached by exactly one processor Processor has written block Copy in memory is obsolete 104

Directory-based Protocol - step1 Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache Cache

Cache CPU 0 CPU 1 CPU 2 105 X has value 7 step 2 Interconnection Network Bit Vector X U000 Directories

X 7 Memories Caches CPU 0 CPU 1 CPU 2 106 CPU 0 Reads X step 3 Interconnection Network Read Miss X U000

Directories X 7 Memories Caches CPU 0 CPU 1 CPU 2 107 CPU 0 Reads X step 4

Interconnection Network X S100 Directories X 7 Memories Caches CPU 0 CPU 1 CPU 2 108

CPU 0 Reads X step 5 Interconnection Network X S100 Directories X 7 Memories Caches X 7 CPU 0 CPU 1

CPU 2 109 CPU 2 Reads X step 6 Interconnection Network X S100 Directories Memories Caches Read Miss X 7

X 7 CPU 0 CPU 1 CPU 2 110 CPU 2 Reads X step 7 Interconnection Network X S101 Directories X 7 Memories

Caches X 7 CPU 0 CPU 1 CPU 2 111 CPU 2 Reads X step 8 Interconnection Network X S101 Directories

X 7 Memories Caches X 7 CPU 0 X 7 CPU 1 CPU 2 112 CPU 0 Writes 6 to X step 9 Interconnection Network

Write Miss X S101 Directories X 7 Memories Caches X 7 CPU 0 X 7 CPU 1 CPU 2

113 CPU 0 Writes 6 to X step 10 Interconnection Network X S101 Directories Invalidate Memories Caches X 7 CPU 0

X 7 X 7 CPU 1 CPU 2 114 CPU 0 Writes 6 to X step 11 Interconnection Network X E100 Directories X 7 Memories

Caches X 6 CPU 0 CPU 1 CPU 2 115 CPU 1 Reads X step 12 Interconnection Network Read Miss X E100 Directories

X 7 Memories Caches X 6 CPU 0 CPU 1 CPU 2 116 CPU 1 Reads X step 13 Interconnection Network

Switch to Shared X E100 Directories X 7 Memories Caches X 6 CPU 0 CPU 1 CPU 2

117 CPU 1 Reads X step 14 Interconnection Network X E100 Directories X 6 Memories Caches X 6 CPU 0 CPU 1

CPU 2 118 CPU 1 Reads X step 15 Interconnection Network X S110 Directories X 6 Memories Caches X 6

CPU 0 X 6 CPU 1 CPU 2 119 CPU 2 Writes 5 to X step 16 Interconnection Network X S110 Directories Memories Caches

Write Miss X 6 CPU 0 X 6 X 6 CPU 1 CPU 2 120 CPU 2 Writes 5 to X - step 17 Interconnection Network Invalidate

X S110 Directories X 6 Memories Caches X 6 CPU 0 X 6 CPU 1 CPU 2

121 CPU 2 Writes 5 to X step 18 Interconnection Network X E001 Directories X 6 Memories X 5 Caches CPU 0

CPU 1 CPU 2 122 CPU 0 Writes 4 to X step 19 Interconnection Network Write Miss X E001 Directories X 6 Memories X 5

Caches CPU 0 CPU 1 CPU 2 123 CPU 0 Writes 4 to X step 20 Interconnection Network X E100 Directories Memories

Take Away X 6 X 5 Caches CPU 0 CPU 1 CPU 2 124 CPU 0 Writes 4 to X step 21

Interconnection Network X E010 Directories X 5 Memories X 5 Caches CPU 0 CPU 1 CPU 2

125 CPU 0 Writes 4 to X step 22 Interconnection Network X E100 Directories X 5 Memories Caches CPU 0 CPU 1

CPU 2 126 CPU 0 Writes 4 to X step 23 Interconnection Network X E100 Directories Creates cache block storage for X Memories Caches

X 5 X 5 CPU 0 CPU 1 CPU 2 127 CPU 0 Writes 4 to X step 24 Interconnection Network X E100 Directories

X 5 Memories Caches X 4 CPU 0 CPU 1 CPU 2 128 CPU 0 Writes Back X Block step 25 Interconnection Network

Data Write Back X E100 Directories X 45 Memories Caches X 4 CPU 0 CPU 1 CPU 2

129 CPU 0 flushes cache block X step 26 Interconnection Network X U000 Directories X 4 Memories Caches CPU 0 CPU 1

CPU 2 130 Characteristics of Multiprocessors Interprocessor communication is done in the memory interface by read and write instructions Memory may be physically distributed and the reads and writes from different processors may take different time and congestion of the interconnection network may occur. Memory latency (i.e., time to complete a read or write) may be long and variable. Most messages through the bus or interconnection network are the size of single memory words. Randomization of requests may be used to

reduce the probability of collisions. 131 Multicomputers Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing 132 Typically, Two Flavors of Multicomputers Commercial multicomputers Custom switch network

Low latency (the time it takes to get a response from something). High bandwidth (data path width) across processors Commodity clusters Mass produced computers, switches and other equipment Use low cost components Message latency is higher Communications bandwidth is lower 133 Multicomputer Communication Processors are connected by an interconnection network Each processor has a local memory and can only access its own local memory Data is passed between processors using

messages, as dictated by the program Data movement across the network is also asynchronous A common approach is to use MPI to handling message passing 134 Multicomputer Communications (cont) Multicomputers can be scaled to larger sizes much easier than multiprocessors. The amount of data transmissions between processors have a huge impact on the performance The distribution of the data among the processors is a very important factor in the performance efficiency. 135

Message-Passing Advantages No problem with simultaneous access to data. Allows different PCs to operate on the same data independently. Allows PCs on a network to be easily upgraded when faster processors become available. 136 Disadvantages of Message-Passing Programmers must make explicit message-passing calls in the code This is low-level programming and is error prone. Data is not shared but copied, which increases the total data size.

Data Integrity Difficulty in maintaining correctness of multiple copies of data item. 137 Some Interconnection Network Terminology (1/2) References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks. A link is the connection between two nodes. A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed. The link between two nodes can be either bidirectional or use two directional links . Either one wire to carry one bit or parallel wires (one wire for each bit in word) can be used. The above choices do not have a major impact

on the concepts presented in this course. 138 Network Terminology (2/2) The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second). The network latency is the time required to transfer a message through the network. The communication latency is the total time required to send a message, including software overhead and interface delay. The message latency or startup time is the time required to send a zero-length message. Includes software & hardware overhead, such as Choosing a route packing and unpacking the message 139

Circuit Switching Message Passing Technique establishes a path and allows the entire message to transfer uninterrupted. Similar to telephone connection that is held until the end of the call. Links used are not available to other messages until the transfer is complete. Latency (message transfer time): If the length of control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentially the constant L/B, where L is message length and B is bandwidth. 140 Store-and-forward Packet Switching Message is divided into packets of

information Each packet includes source and destination addresses. Packets can not exceed a fixed, maximum size (e.g., 1000 byte). A packet is stored in a node in a buffer until it can move to the next node. 141 Packet Switching (cont) At each node, the designation information is looked at and used to select which node to forward the packet to. Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams. Significant latency is created by storing each packet in each node it reaches. Latency increases linearly with the length of the

route. 142 Virtual Cut-Through Package Switching Used to reduce the latency. Allows packet to pass through a node without being stored, if the outgoing link is available. If complete path is available, a message can immediately move from source to destination.. 143 Wormhole Routing Alternate to store-and-forward packet routing A message is divided into small units

called flits (flow control units). Flits are 1-2 bytes in size. Can be transferred in parallel on links with multiple wires. Only head of flit is initially transferred when the next link becomes available. 144 Wormhole Routing (cont) As each flit moves forward, the next flit can move forward. The entire path must be reserved for a message as these packets pull each other along (like cars of a train). Request/acknowledge bit messages are required to coordinate these pull-along moves. See Wilkinson, et. al.

The complete path must be reserved, as these flits are linked together. Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant L/B, with L the message length and B the link bandwidth. 145 Deadlock Routing algorithms needed to find a path between the nodes. Adaptive routing algorithms choose different paths, depending on traffic conditions. Livelock is a deadlock-type situation where a packet continues to go around the network, without ever reaching its destination. Deadlock: No packet can be forwarded because they are blocked by other stored packets waiting

to be forwarded. 146 Asymmetric Multicomputers Has a front-end that interacts with users and I/O devices. Processors in back end are used for computation. Similar to SIMDs (or processor arrays) Common with early multicomputers Examples of asymmetrical multicomputers given in textbook. 147 Asymmetrical MC Advantages Back-end processors dedicated to parallel

computations Easier to understand, model, tune performance Only a simple backend operating system needed Easy for a vendor to create 148 Asymmetrical MC Disadvantages Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult

Every application requires development of both front-end and back-end program 149 Symmetric Multicomputers Every computer executes the same operating system and has identical functionality. Users may log into any computer to edit or compile their programs. Any or all computers may be involved in the execution of their program. During execution of programs, every PE executes the same program. When only one PE should execute an operation, an if statement is used to select the PE. 150

Symmetric Multicomputers 151 Symmetrical MC Advantages Alleviate performance bottleneck caused by single front-end computer Better support for debugging Every processor executes same program 152 Symmetrical MC Disadvantages More difficult to maintain

illusion of single parallel computer No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor Details on next slide 153 Symmetric MC Disadvantages (cont) (cont.) More difficult to achieve high performance when multiple processes on each processor Processes on same processor compete for

same resources CPU Cycles Cache space Memory bandwidth Increased cache misses Cache is PE oriented instead of process oriented 154 ParPar Cluster, A Mixed Model Mixed model Incorporates both asymetrical and symetrical designs. 155 A Commodity Cluster vs

Network of Workstations A commodity cluster contains components of local area networks Commodity computers Switches A network of workstations is a dispersed collection of computers Distributed hetergeneous computers Located on primary users desks Unused cycles available for parallel use 156 Best Model for Commodity Cluster Full-Fledged operating system (e.g., Linux) desirable Feature of symmetric multicomputer Desirable to increase cache hits

Favors having only a single user process on each PE Favors most nodes being off-limits for program development Need fast network Keep program development users off networks. Access front-end by another path. Overall, a mixed model may be best for commodity clusters 157 Ideal Commodity Cluster Features

Co-located computers Dedicated to running parallel jobs No keyboards or displays Identical operating system Identical local disk images Administered as an entity 158 Network of Workstations Dispersed computers Typically located on users desks

First priority: person at keyboard Parallel jobs run in background Different operating systems Different local images Checkpointing and restarting important Typically connected by ethernet Too slow for commodity network usage 159 Summary Commercial parallel computers appeared in 1980s Multiple-CPU computers now dominate

Small-scale: Centralized multiprocessors Large-scale: Distributed memory architectures Multiprocessors Multicomputers 160

Recently Viewed Presentations

  • Primer on Fourier Analysis

    Primer on Fourier Analysis

    Primer on Fourier Analysis Dana Moshkovitz Princeton University and The Institute for Advanced Study Fourier Analysis in Theoretical Computer Science Fourier Analysis in Theoretical Computer Science (Unofficial List) "The Fourier Magic" "something that looks scary to analyze" "bunch of (in)equalities"...
  • BEOWULF Essential Questions: What is an epic? EPICS

    BEOWULF Essential Questions: What is an epic? EPICS

    Story is told in heightened language. EPICS. EPIC CONVENTIONS-Shared characteristics of epics writers drew upon . to establish the epic quality of their poems. 5 Epic Conventions: Invocation of the muse. Action begins "in medias res" ...
  • Open Access  a funders perspective Robert Terry Senior

    Open Access a funders perspective Robert Terry Senior

    1.00 4.00 NEPHRON. NEPHRON. 1.00 4.00 HISTOLOGY & HISTOPAT ... Arial Wingdings Corp template screen Microsoft Excel Worksheet Microsoft Photo Editor 3.0 Photo Microsoft Clip Gallery Open Access - a funder's perspective PowerPoint Presentation Human genome project PowerPoint Presentation ...
  • Matching taS TO COURSES - Brown University

    Matching taS TO COURSES - Brown University

    Stability and Preferences. Assume strict, responsive preferences 'Hiring meeting ©' is a stable matching mechanism. Individually rational: No student is matched to an unacceptable course and vice versa. Stable: No blocking pair. Due to strict preferences, no blocking pair is...
  • PropaGator - University of Florida

    PropaGator - University of Florida

    Objective. PropaGator is an intelligent autonomous surface vehicle (ASV) designed to clean up the pollution in our oceans. PropaGator navigates to GPS waypoints that are known to have lots of pollution between them and nets are used to remove the...
  • New Capacity Bounds for the Binary Deletion Channel

    New Capacity Bounds for the Binary Deletion Channel

    Find good codes for basic sticky channels. Easiest channels with block structure : no deletions, just duplicates. May yield insight for other channels. Low-density parity-check coding techniques seem applicable. The Goals Simple, clear, tight bounds for capacity for binary deletion...
  • CPLD Basics - godinweb.ca

    CPLD Basics - godinweb.ca

    Note: Some of the images contained in this presentation are captured from Altera's Quartus II software. Please see the Altera web site (www.altera.com) for more details on their product and design solutions.
  • Assisted Suicide - Whitney Ayers - Introductions

    Assisted Suicide - Whitney Ayers - Introductions

    Upon completion of this seminar you should be able to define and differentiate between the terms euthanasia, and physician-assisted suicide. Describe the laws and ethics regarding assisted suicide and the controversies surrounding the issue. Identify two of the criteria that...