• Instruction decode/register fetch (ID). Decode the instruction. Read the source registers from the register file. A ← Regs[IR6..10]; B = Regs[IR11..15] Sign-extend the offset (displacement) field of the instruction. Imm ← sign-extend(IR16..31) Check for a possible branch (by reading values from the source registers). Cond ← (A rel B) Compute the branch target address by adding the to the ALU_Output ← NPC + Imm If the branch is taken, store the branch-target address into the PC. If (cond) PC ← ALU_Output, else PC ← NPC What feature of the ISA makes it possible to read the registers in this stage?
• Execute/compute effective address (EX). The ALU operates on the operands, performing one of three types of functions, depending on the opcode
Ø Memory reference: ALU adds to form the effective address. ALU_Output ←
and
Ø Register-register instruction: ALU performs operation on the values read from the register file. ALU_Output ← A op B Ø Register-immediate instruction: ALU performs operation on the and the
Lecture 14
Advanced Microprocessor Design
2
ALU_Output ← A op Imm In a load-store architecture, execution can be done at the same time as effective-address computation because • Memory access (MEM). Load_Mem_Data ← Mem[ALU_Output] Mem[ALU_Output] ← B /* Store */
/* Load */
• Write-back (WB). If the instruction is register-register or , the result is written into the register file at the address specified by the destination operand. Reg-Reg ALU Operation: Regs[IR16..20] ← ALU_Output Reg-Immediate ALU Operation: Regs[IR11..15] ← ALU_Output Load instruction: Regs[IR11..15] ← Load_Mem_Data In this implementation, some instructions require 2 cycles, some require 4, and some require 5. • 2 cycles: • 4 cycles: • 5 cycles: Assuming the instruction frequencies from the integer benchmarks mentioned in the last lecture, what’s the CPI of this architecture?
Pipelining our RISC It’s easy to pipeline this architecture—just make each clock cycle into a pipe stage.
MEM WB MEM WB EX ID IF MEM WB EX ID MEM WB EX MEM WB
Here is a diagram of our instruction pipeline.
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback (WB)
ALU
NPC
4 ALU
PC
Instruction cache IR (inst. reg.)
MUX
A
ALU
Regs
B
Signextend
Imm
In this pipeline, the major functional units are used in different cycles, so overlapping the execution of instructions introduces few conflicts. • Separating the instruction and data caches eliminates a conflict that would arise in the IF and MEM stages. Of course, we have to access these caches faster than we would in an unpipelined processor. • The register file is used in two stages:
Lecture 14
Advanced Microprocessor Design
MUX
MUX
4
cond
Data cache
LMD
Thus, we need to perform each clock cycle.
reads and
writes
To handle reads and writes to the same register, we write in the first half of the clock cycle and read in the second half. • Something is incomplete about our diagram of the IF stage. What?
We’ve omitted one thing from the diagram above: We need a place to save values between pipeline stages. Otherwise, the different instructions in the pipeline would interfere with each other.
What is our pipeline speedup, then …? Of course, we have to allow for latch-delay time. We also need to allow for clock skew—the maximum delay between when the clock arrives at any two registers. Let’s define To’head = Tlatch + Tskew. Speedup = Avg. unpipelined execution time Avg. pipelined execution time =T Tunpipe
unpipe
n
+ To'head . n
= n (ideal case where To'head = 0) Example: Consider the unpipelined processor in the previous example. Assume— • Clock cycle is 1 ns. • Branch instructions, 20% of the total, take 2 cycles. • Store instructions, 10% of the total, take 4 cycles. • All other instructions take 5 cycles. • Clock skew and latch delay add 0.2 ns. to the cycle time. What is the speedup from pipelining?
Lecture 14
Advanced Microprocessor Design
6
How can pipelining help? How can pipelining improve performance? • If we keep CT constant, by improving CPI …
50n IF/ID and MEM/WB are unpipelined IF 50n Pipeline IF ID EX MEM WB ID EX MEM WB
• If we keep CPI constant, by improving CT …
50ns Unpipelined Pipelined
IF1 25ns
Structural hazards Consider a pipeline with a unified instruction-data cache. Clock #
Load instr. Instr. i+1 Instr. i+2 Instr. i+3 Instr. i+4 Instr. i+5 Instr. i+6
1
IF
2
ID IF
3
EX ID IF
4
EX ID stall
5
6
7
8
9
10
MEM WB MEM WB EX IF MEM WB ID IF EX ID IF MEM WB EX ID IF MEM WB EX ID MEM EX
Instruction i+3 has to stall, because the load instruction “steals” an instruction-fetch cycle. In this pipeline, what kind of instructions (what “opcodes”) cause structural hazards?