



CPU Design and Pipelining – Pag





# Data Flow

- The better we can break up the execution of an instruction into its sub-cycles, the better we will be able to optimize the processor's performance
- This partitioning of the instruction's operation depends on the CPU design
- In general, there is a sequence of events that can be described that make up the execution of an instruction
  - Fetch cycle
  - Data fetch cycle
  - Indirect cycle
  - Execute cycle
  - Interrupt cycle

CSCI 4717 – Computer Architecture

CPU Design and Pipelining – Page 5























Improved Performance of Prefetch (continued)
Examining operation of prefetch appears to take half as many cycles as the number of instructions increases
Performance, however, is not doubled:

Fetch usually shorter than execution
Prefetch more than one instruction?

 Any jump or branch means that prefetched instructions are not the required instructions Add more stages to improve performance

CSCI 4717 – Computer Architecture CPU Design and Pipelining – Page 16

















Pipeline Performance Equations (continued)

- In general, *d* is equivalent to a clock pulse and  $\tau_m >> d$ .
- For *n* instructions with no branches, the total time required to execute all *n* instructions through a *k*-stage pipeline, *T<sub>k</sub>*, is:

$$T_k = [k + (n-1)]\tau$$

• It takes *k* cycles to fill the pipeline, then once cycle each for the remaining *n*-1 instructions.

CPU Design and Pipelining – Page 2

CSCI 4717 – Computer Architecture





## **Multiple Streams**

- Branch penalty is a result of having two possible paths of execution
- · Solution: Have two pipelines
- Prefetch each branch into a separate pipeline
- Once outcome of conditional branch is determined, use appropriate pipeline
- Competing for resources this method leads to bus & register contention
- More streams than pipes multiple branches lead to further pipelines being needed

CSCI 4717 – Computer Architecture

CPU Design and Pipelining – Page 29



5

## Loop Buffer

- Add a small, very fast memory
- · Maintained by fetch stage of pipeline
- Use it to contain the n most recently fetched instructions in sequence.
- Before taking a branch, see if branch target is in buffer

CPU Design and Pipelining – Page 31

- Similar in concept to a cache dedicated to instructions while maintaining an order of execution
- Used by CRAY-1

CSCI 4717 - Computer Architecture

# Loop Buffer Benefits Particularly effective with loops if the buffer is large enough to contain all of the instructions in a loop. Instructions only need to be fetched once. If executing from within the buffer, buffer acts like a prefetch by having all of the instructions already loaded into high-speed memory without having to access main memory or cache.

CPU Design and Pipelining – Page 3

CSCI 4717 – Computer Architecture









# Taken/Not taken switch

- · Storing one bit for history:
  - 0: last branch not taken
  - 1: last branch taken
  - Shortcoming is with loops where first branch is always predicted wrong since last time through loop, CPU didn't branch

CPU Design and Pipelining – Page 37

CPU Design and Pipelining – Page 39

- · Storing two bits for history:
  - 00: branch not taken, followed by branch taken
  - 01: branch taken, followed by branch not taken
  - 10: two branch taken in a row
  - 11: two branch not taken in a row
  - Can be optimized for loops

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture



# Branch History Table There are three things that should be kept in the branchy history table Address of the branch instruction Bits indicating branch history Branch target information, i.e., where are we going

