Pipeline

Published on December 2016 | Categories: Documents | Downloads: 54 | Comments: 0 | Views: 958

of 8

pipeline in advanced microprocessor

Content

Pipelining
We’ve already covered the basics of pipelining in Lecture 1. We saw that cars could be built on an assembly line, and that instructions could be executed in much the same way. [H&P §A.1] In the ideal situation, this could give a speedup equal to the number of pipeline stages: Time to execute instruction on unpipelined machine Number of pipe stages However, this assumes “perfectly balanced” stages—each stage requires exactly the same amount of time. This is rarely the case, and anyway, pipelining does involve some extra overhead. Three aspects of RISC architectures make them easy to pipeline: • All operations on data apply to data in registers. • Only load and store operations move data between memory and registers. • All instructions are the same size, and there are few instruction formats. An unpipelined RISC For our examples, we’ll work with a simplified RISC instruction set. In an unpipelined implementation, instructions take at most 5 clock cycles. One cycle is devoted to each of— • Instruction fetch (IF). Fetch the current instruction (the one pointed to by PC). IR ← Mem[PC] Update the PC by adding NPC ← PC +
© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 1
Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

• Instruction decode/register fetch (ID). Decode the instruction. Read the source registers from the register file. A ← Regs[IR6..10]; B = Regs[IR11..15] Sign-extend the offset (displacement) field of the instruction. Imm ← sign-extend(IR16..31) Check for a possible branch (by reading values from the source registers). Cond ← (A rel B) Compute the branch target address by adding the to the ALU_Output ← NPC + Imm If the branch is taken, store the branch-target address into the PC. If (cond) PC ← ALU_Output, else PC ← NPC What feature of the ISA makes it possible to read the registers in this stage?

• Execute/compute effective address (EX). The ALU operates on the operands, performing one of three types of functions, depending on the opcode

Ø Memory reference: ALU adds to form the effective address. ALU_Output ←

and

Ø Register-register instruction: ALU performs operation on the values read from the register file. ALU_Output ← A op B Ø Register-immediate instruction: ALU performs operation on the and the

Lecture 14

Advanced Microprocessor Design

2

ALU_Output ← A op Imm In a load-store architecture, execution can be done at the same time as effective-address computation because • Memory access (MEM). Load_Mem_Data ← Mem[ALU_Output] Mem[ALU_Output] ← B /* Store */

/* Load */

• Write-back (WB). If the instruction is register-register or , the result is written into the register file at the address specified by the destination operand. Reg-Reg ALU Operation: Regs[IR16..20] ← ALU_Output Reg-Immediate ALU Operation: Regs[IR11..15] ← ALU_Output Load instruction: Regs[IR11..15] ← Load_Mem_Data In this implementation, some instructions require 2 cycles, some require 4, and some require 5. • 2 cycles: • 4 cycles: • 5 cycles: Assuming the instruction frequencies from the integer benchmarks mentioned in the last lecture, what’s the CPI of this architecture?

Pipelining our RISC It’s easy to pipeline this architecture—just make each clock cycle into a pipe stage.

© 2002 Edward F. Gehringer

ECE 463/521 Lecture Notes, Fall 2002

3

Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

Clock #
Instruction i Instr. i+1 Instr. i+2 Instr. i+3 Instr. i+4

1
IF

2
ID IF

3
EX ID IF

4
EX ID IF

5

6

7

8

9

MEM WB MEM WB EX ID IF MEM WB EX ID MEM WB EX MEM WB

Here is a diagram of our instruction pipeline.
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback (WB)

ALU
NPC

4 ALU

PC
Instruction cache IR (inst. reg.)

MUX
A

ALU

Regs

B

Signextend

Imm

In this pipeline, the major functional units are used in different cycles, so overlapping the execution of instructions introduces few conflicts. • Separating the instruction and data caches eliminates a conflict that would arise in the IF and MEM stages. Of course, we have to access these caches faster than we would in an unpipelined processor. • The register file is used in two stages:

Lecture 14

Advanced Microprocessor Design

MUX

MUX
4

cond

Data cache

LMD

Thus, we need to perform each clock cycle.

reads and

writes

To handle reads and writes to the same register, we write in the first half of the clock cycle and read in the second half. • Something is incomplete about our diagram of the IF stage. What?

We’ve omitted one thing from the diagram above: We need a place to save values between pipeline stages. Otherwise, the different instructions in the pipeline would interfere with each other.

So we insert latches, or pipeline registers, between stages. Of course, we’d need latches even in an unpipelined multicycle implementation.
© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 5
Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

What is our pipeline speedup, then …? Of course, we have to allow for latch-delay time. We also need to allow for clock skew—the maximum delay between when the clock arrives at any two registers. Let’s define To’head = Tlatch + Tskew. Speedup = Avg. unpipelined execution time Avg. pipelined execution time =T Tunpipe
unpipe

n

+ To'head . n

= n (ideal case where To'head = 0) Example: Consider the unpipelined processor in the previous example. Assume— • Clock cycle is 1 ns. • Branch instructions, 20% of the total, take 2 cycles. • Store instructions, 10% of the total, take 4 cycles. • All other instructions take 5 cycles. • Clock skew and latch delay add 0.2 ns. to the cycle time. What is the speedup from pipelining?

Lecture 14

Advanced Microprocessor Design

6

How can pipelining help? How can pipelining improve performance? • If we keep CT constant, by improving CPI …
50n IF/ID and MEM/WB are unpipelined IF 50n Pipeline IF ID EX MEM WB ID EX MEM WB

• If we keep CPI constant, by improving CT …
50ns Unpipelined Pipelined
IF1 25ns

IF
IF2 ID1

ID
ID2

EX
EX1 EX2

MEM
MEM MEM 1 2

WB
WB1 WB2

CPI_pipe=1 CPI_pipe=1

• Usually we improve both CT and CPI. Pipeline hazards A hazard reduces the performance of the pipeline. Hazards arise because of the program’s characteristics. There are three kinds of hazards. • Structural hazards—Not enough hardware resources exist for all combinations of instructions. • Data hazards—Dependences between instructions prevent their overlapped execution. • Control hazards—Branches change the PC, which results in stalls while branch targets are fetched.
© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 7
Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

Structural hazards Consider a pipeline with a unified instruction-data cache. Clock #
Load instr. Instr. i+1 Instr. i+2 Instr. i+3 Instr. i+4 Instr. i+5 Instr. i+6

1
IF

2
ID IF

3
EX ID IF

4
EX ID stall

5

6

7

8

9

10

MEM WB MEM WB EX IF MEM WB ID IF EX ID IF MEM WB EX ID IF MEM WB EX ID MEM EX

Instruction i+3 has to stall, because the load instruction “steals” an instruction-fetch cycle. In this pipeline, what kind of instructions (what “opcodes”) cause structural hazards?

Lecture 14

Advanced Microprocessor Design

8

Pipeline

Comments

Content

Sponsor Documents

Recommended