CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Single-Cycle DataPath Lecture 15 CDA 3103 07-09-2014 Review of Virtual Memory Next level in the memory hierarchy Provides illusion of very large main memory Working set of pages residing in main memory (subset of all pages residing on disk) Main goal: Avoid reaching all the way back to disk as much as possible Additional goals: Let OS share memory among many programs and protect them from each other Each process thinks it has all the memory to itself 2

Review: Paging Terminology Programs use virtual addresses (VAs) Space of all virtual addresses called virtual memory (VM) Divided into pages indexed by virtual page number (VPN) Main memory indexed by physical addresses (PAs) Space of all physical addresses called physical memory (PM) Divided into pages indexed by physical page number (PPN) 3 Review: Translation Look-Aside Buffers (TLBs) TLBs usually small, typically 128 - 256 entries Like any other cache, the TLB can be direct mapped, set associative, or fully associative VA

Processor hit PA TLB Lookup miss Translation Cachemiss hit Main Memory data

On TLB miss, get page table entry from main memory 4 Review: Memory Hierarchy Regs Instr. Operands Cache Blocks Upper Level Faster L2 Cache Blocks Last Week: Virtual Memory

{ Memory Pages Disk Files Tape Larger Lower Level 5 Review Example 1 A set-associative cache consists of 64 lines, or slots, divided into four-line sets. Main memory contains 4K blocks of 128 words each. Show the format of main memory addresses. Solution

The cache is divided into 16 sets of 4 lines each. Therefore, 4 bits are needed to identify the set number. Main memory consists of 4K = 212 blocks. Therefore, the set plus tag lengths must be 12 bits and therefore the tag length is 8 bits. Each block contains 128 words. Therefore, 7 bits are needed to specify the word. Review Example 2 A two-way set-associative cache has lines of 16 bytes and a total size of 8 kbytes. The 64-Mbyte main memory is byte addressable. Show the format of main memory addresses. Solution

There are a total of 8 kbytes/16 bytes = 512 lines in the cache. Thus the cache consists of 256 sets of 2 lines each. Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26-bit address is needed. Main memory consists of 64-Mbyte/16 bytes = 222 blocks. Therefore, the set plus tag lengths must be 22 bits, so the tag length is 14 bits and the word field length is 4 bits. Agenda Stages of the Datapath Datapath Instruction Walkthroughs Datapath Design Dr Dan Garcia Five Components of a Computer Computer Keyboard, Mouse Devices

Memory Disk (passive) Processo Input (where r Contr programs (where ol , programs, Output data live data live when not Datapath when running) running)

Display, Printer Dr Dan Garcia The CPU Processor (CPU): the active part of the computer that does all the work (data manipulation and decision-making) Datapath: portion of the processor that contains hardware necessary to perform operations required by the processor (the brawn) Control: portion of the processor (also in hardware) that tells the datapath what needs to be done (the brain) Dr Dan Garcia Stages of the Datapath : Overview Problem: a single, atomic block that executes

an instruction (performs all necessary operations beginning with fetching the instruction) would be too bulky and inefficient Solution: break up the process of executing an instruction into stages, and then connect the stages to create the whole datapath smaller stages are easier to design easy to optimize (change) one stage without touching the others Dr Dan Garcia Five Stages of the Datapath Stage 1: Instruction Fetch Stage 2: Instruction Decode

Stage 3: ALU (Arithmetic-Logic Unit) Stage 4: Memory Access Stage 5: Register Write Dr Dan Garcia Stages of the Datapath (1/5) There is a wide variety of MIPS instructions: so what general steps do they have in common? Stage 1: Instruction Fetch no matter what the instruction, the 32-bit instruction word must first be fetched from memory (the cache-memory hierarchy) also, this is where we Increment PC (that is, PC = PC + 4, to point to the next instruction: byte addressing so + 4) Dr Dan Garcia Stages of the Datapath (2/5) Stage 2: Instruction Decode

upon fetching the instruction, we next gather data from the fields (decode all necessary instruction data) first, read the opcode to determine instruction type and field lengths second, read in data from all necessary registers for add, read two registers for addi, read one register for jal, no reads necessary Dr Dan Garcia Stages of the Datapath (3/5) Stage 3: ALU (Arithmetic-Logic Unit) the real work of most instructions is done here: arithmetic (+, -, *, /), shifting, logic (&, |), comparisons (slt) what about loads and stores? lw

$t0, 40($t1) the address we are accessing in memory = the value in $t1 PLUS the value 40 so we do this addition in this stage Dr Dan Garcia Stages of the Datapath (4/5) Stage 4: Memory Access actually only the load and store instructions do anything during this stage; the others remain idle during this stage or skip it all together since these instructions have a unique step, we need this extra stage to account for them as a result of the cache system, this stage is expected to be fast Dr Dan Garcia Stages of the Datapath (5/5) Stage 5: Register Write

most instructions write the result of some computation into a register examples: arithmetic, logical, shifts, loads, slt what about stores, branches, jumps? dont write anything into a register at the end these remain idle during this fifth stage or skip it all together Dr Dan Garcia Information encoded in binary Combinational element

Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses 4.2 Logic Design Conventions Logic Design Basics Operate on data Output is a function of input State (sequential) elements Store information

Chapter 4 The Processor 20 Combinational Elements AND-gate Y=A&B A B Multiplexer

A + Y=A+B Y B Y Adder Y = S ? I1 : I0 I0 I1

M u x S Arithmetic/Logic Unit Y = F(A, B) A ALU Y Y B F Chapter 4 The Processor 21

ALU used for Load/Store: F = add Branch: F = subtract R-type: F depends on funct field ALU control Function 0000 AND 0001

OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR 4.4 A Simple Implementation Scheme

ALU Control Chapter 4 The Processor 22 Sequential Elements Register: stores data in a circuit Uses a clock signal to determine when to update the stored value Edge-triggered: update when Clk changes from 0 to 1 Clk D

Clk Q D Q Chapter 4 The Processor 23 Sequential Elements Register with write control Only updates on clock edge when write control input is 1 Used when stored value is required later

Clk D Write Clk Q Write D Q Chapter 4 The Processor 24 Clocking Methodology Combinational logic transforms data during clock cycles

Between clock edges Input from state elements, output to state element Longest delay determines clock period Chapter 4 The Processor 25 Datapath Elements that process data and addresses in the CPU

4.3 Building a Datapath Building a Datapath Registers, ALUs, muxs, memories, We will build a MIPS datapath incrementally Refining the overview design Chapter 4 The Processor 26 Instruction Fetch 32-bit register

Increment by 4 for next instruction Chapter 4 The Processor 27 R-Format Instructions Read two register operands Perform arithmetic/logical operation Write register result Chapter 4 The Processor 28 Load/Store Instructions

Read register operands Calculate address using 16-bit offset Use ALU, but sign-extend offset Load: Read memory and update register Store: Write register value to memory Chapter 4 The Processor 29 Branch Instructions

Read register operands Compare operands Use ALU, subtract and check Zero output Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4 Already calculated by instruction fetch

Chapter 4 The Processor 30 Branch Instructions Just re-routes wires Sign-bit wire replicated Chapter 4 The Processor 31 Composing the Elements First-cut data path does an instruction in one clock cycle

Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Use multiplexers where alternate data sources are used for different instructions Chapter 4 The Processor 32 R-Type/Load/Store Datapath Chapter 4 The Processor 33 Full Datapath Chapter 4 The Processor 34

+4 1. Instruction Fetch rd rs rt ALU Data memory registers PC instruction memory

Generic Steps of Datapath imm 2. Decode/ Register Read Dr Dan Garcia 3. Execute 4. Memory 5. Register Write Datapath Walkthroughs (1/3) add $r3,$r1,$r2 # r3 = r1+r2 Dr Dan Garcia

Datapath Walkthroughs (1/3) add $r3,$r1,$r2 # r3 = r1+r2 Stage 1: fetch this instruction, increment PC Stage 2: decode to determine it is an add, then read registers $r1 and $r2 Stage 3: add the two values retrieved in Stage 2 Stage 4: idle (nothing to write to memory) Stage 5: write result of Stage 3 into register $r3 02/24/2020 Dr Dan Garcia 37 +4 reg[2] imm

Dr Dan Garcia reg[1]+ reg[2] ALU Data memory registers 3 1 2 reg[1] add r3, r1, r2

PC instruction memory Example: add Instruction Datapath Walkthroughs (2/3) slti $r3,$r1,17 # if (r1 <17 )r3 = 1 else r3 = 0 Dr Dan Garcia Datapath Walkthroughs (2/3) slti $r3,$r1,17 # if (r1 <17 )r3 = 1 else r3 = 0 Stage 1: fetch this instruction, increment PC Stage 2: decode to determine it is an slti, then read register $r1

Stage 3: compare value retrieved in Stage 2 with the integer 17 Stage 4: idle Stage 5: write the result of Stage 3 (1 if reg source was less than signed immediate, 0 otherwise) into register $r3 02/24/2020 Dr Dan Garcia 40 imm 17 Dr Dan Garcia reg[1] <17?

ALU Data memory registers +4 3 x 1 reg[1] slti r3, r1, 17 PC

instruction memory Example: slti Instruction Datapath Walkthroughs (3/3) sw $r3,17($r1) # Mem[r1+17]=r3 Dr Dan Garcia Datapath Walkthroughs (3/3) sw $r3,17($r1) # Mem[r1+17]=r3 Stage 1: fetch this instruction, increment PC Stage 2: decode to determine it is a sw, then read registers $r1 and $r3 Stage 3: add 17 to value in register $r1 (retrieved in Stage 2) to compute address Stage 4: write the value contained in register $r3 (retrieved in Stage 2) into memory address computed in Stage 3

Stage 5: idle (nothing to write into a register) Dr Dan Garcia 43 17 ALU MEM[r1+17]<=r3 imm reg[3] reg[1] +17 Data memory

registers +4 3 x 1 reg[1] SW r3, 17(r1) PC instruction memory Example: sw Instruction

Dr Dan Garcia Why Five Stages? (1/2) Could we have a different number of stages? Yes, and other architectures do So why does MIPS have five if instructions tend to idle for at least one stage? Five stages are the union of all the operations needed by all the instructions. One instruction uses all five stages: the load Dr Dan Garcia Why Five Stages? (2/2) lw $r3,17($r1) # r3=Mem[r1+17] Stage 1: fetch this instruction, increment PC Stage 2: decode to determine it is a lw, then read register $r1 Stage 3: add 17 to value in register $r1 (retrieved

in Stage 2) Stage 4: read value from memory address computed in Stage 3 Stage 5: write value read in Stage 4 into register $r3 Dr Dan Garcia 17 Dr Dan Garcia ALU MEM[r1+17] imm reg[1] +17 Data

memory registers +4 3 x 1 reg[1] LW r3, 17(r1) PC instruction memory

Example: lw Instruction Peer Instruction How many places in this diagram will need a multiplexor to select one from multiple inputs? a) 0 b) 1 c) 2 d) 3 e) 4 or more Dr Dan Garcia Peer Instruction Cant just join wires together

Use multiplexers Dr Dan Garcia Datapath and Control +4 imm opcode, funct Controller Dr Dan Garcia ALU Data memory rd

rs rt registers PC instruction memory Datapath based on data transfers required to perform instructions Controller causes the right transfers to happen What Hardware Is Needed? (1/2) PC: a register that keeps track of address of the next instruction to be fetched General Purpose Registers Used in Stages 2 (Read) and 5 (Write) MIPS has 32 of these

Memory Used in Stages 1 (Fetch) and 4 (R/W) Caches makes these stages as fast as the others (on average, otherwise multicycle stall) Dr Dan Garcia What Hardware Is Needed? (2/2) ALU Used in Stage 3 Performs all necessary functions: arithmetic, logicals, etc. Miscellaneous Registers One stage per clock cycle: Registers inserted between stages to hold intermediate data and control signals as they travel from stage to stage Note: Register is a general purpose term meaning

something that stores bits. Realize that not all registers are in the register file Dr Dan Garcia CPU Clocking (1/2) For each instruction, how do we control the flow of information though the datapath? Single Cycle CPU: All stages of an instruction completed within one long clock cycle Clock cycle sufficiently long to allow each instruction to complete all stages without interruption within one cycle 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory

Dr Dan Garcia 5. Reg. Write CPU Clocking (2/2) Alternative multiple-cycle CPU: only one stage of instruction per clock cycle Clock is made as long as the slowest stage 1. Instruction 2. Decode/ Fetch Register Read 3. Execute 4. Memory 5. Register

Write Several significant advantages over single cycle execution: Unused stages in a particular instruction can be skipped OR instructions can be pipelined (overlapped) Dr Dan Garcia Processor Design Analyze instruction set architecture (ISA) to determine datapath requirements Meaning of each instruction is given by register transfers Datapath must include storage element for ISA registers Datapath must support each register transfer Select set of datapath components and establish clocking methodology Assemble datapath components to meet requirements Analyze each instruction to determine sequence of control point settings to implement the register transfer

Assemble the control logic to perform this sequencing Dr Dan Garcia Instruction Level Parallelism P1 P2 Instr 1 IF ID Instr 2 IF Instr 3 Instr 4 Instr 5 Instr 6

P3 P4 P5 P6 P7 ALU MEM WR IF ID P8 P9

P 10 P 11 P 12 ALU MEM WR ID IF ID IF ALU MEM WR ALU MEM WR ID IF ALU MEM WR ID ALU MEM WR

IF Instr 7 ID IF Instr 8 ALU MEM WR ID IF Dr Dan Garcia ALU MEM WR ID ALU MEM WR

Summary CPU design involves Datapath, Control 5 Stages for MIPS Instructions 1. 2. 3. 4. 5. Instruction Fetch Instruction Decode & Register Read ALU (Execute) Memory Register Write Datapath timing: single long clock cycle or one short clock cycle per stage Dr Dan Garcia

Recently Viewed Presentations

  • Haciendo comparaciones Review: Comparing things that are not

    Haciendo comparaciones Review: Comparing things that are not

    El chico no es tan guapo como Ryan Reynolds. El equipo de Lamar no es tan bueno como el de MHS. La comida de Olive Garden no es tan sabrosa como la de Macaroni Grill. ... No tengo tanto dinero...
  • I am a Dalek - Our Happy School

    I am a Dalek - Our Happy School

    I AM A DALEK By: Nathan Buttigieg Class: Year 5 Roses Hello, my name is Nathan and today I'm going to tell you about a book called I am a Dalek. PLOT Rose and the Doctor want to go on...
  • Catholic Health Care: Maintaining Identity &amp; Integrity in ...

    Catholic Health Care: Maintaining Identity & Integrity in ...

    Help you gain awareness of how you tend to think about ethical issues and why you tend to think that way.. Improve your capacity to understand the perspectives of others and articulate strong and relevant support for your views.. Deepen...
  • From Measurement to Description

    From Measurement to Description

    The Method of Concomitant Variation is a quantitative approach to hypothesis formulation, in contrast with the qualitative Methods of Agreement and Difference. Contemporary epidemiology owes most of its development to the efforts of epidemiologists and biostatisticians in the middle of...
  • Policies to Stem the Brain Drain: Without Americanizing Canada

    Policies to Stem the Brain Drain: Without Americanizing Canada

    Tiebout Hypothesis: municipalities offer a variety of government services at a variety of prices (taxes). Individuals will relocate until their utility is maximized. Extend Tiebout Hypothesis (ETH); workers search for best mix of wages, jobs, taxes, public services and social...
  • AAE450 Spring 09 - Purdue Engineering

    AAE450 Spring 09 - Purdue Engineering

    Brad Appel Propulsion Group. Backup Slides: Various Empirical Curves. 5. MELCO 250 mN HT Isp (s) Power 1050 1170 1280 1375 1450 1540 1625 1600 1750 1900 2100 2300 2500 2650 Tank Mass Curvefit (Low Mass Systems) Xenon Mass [kg]...
  • PowerPoint Presentation

    PowerPoint Presentation

    Growth in Animals Animals grow throughout the whole organism many regions & tissues at different rates Growth in Plants Specific regions of growth: meristems stem cells: perpetually embryonic tissue regenerate new cells apical shoot meristem growth in length primary growth...
  • Présentation PowerPoint - EDIMARK

    Présentation PowerPoint - EDIMARK

    Ceci est un état des connaissances scientifiques issu de la revue de la littérature et des congrès et dont l'objectif est de fournir des informations sur l'état actuel de la recherche ; ainsi, certaines données présentées sont susceptibles de ne...