Data Forwarding Hardware, Superscalar, VLIW Architecture

<< SRC, RTL, Data Dependence Distance, Forwarding, Compiler Solution to Hazards

Microprogramming, General Microcoded Controller, Horizontal and Vertical Schemes >>

Advanced Computer Architecture-CS501

Advanced Computer Architecture

Lecture 21

Reading Material

Vincent P. Heuring&Harry F. Jordan

Chapter 5

Computer Systems Design and Architecture

5.2

Summary

�

Data Forwarding Hardware

�

Instruction Level Parallelism

�

Difference between Pipelining and Instruction-Level Parallelism

�

Superscalar Architecture

�

Superscalar Design

�

VLIW Architecture

Maximum Distance between two instructions

Example

Read page no. 219 of Computer System Design and Architecture (Vincent

P.Heuring, Harry F. Jordan)

Data forwarding Hardware

The concept of data forwarding was introduced in the previous lecture.

RTL for

data

Page 220

Last Modified: 01-Nov-06

Advanced Computer Architecture-CS501

forwarding in case of ALU instructions

Dependence

RTL

alu5&alu3:((ra5=rb3):X←Z5,

Stage 3-5

(ra5=rc3)&!imm3: Y ← Z5);

alu4&alu3:((ra4=rb3):X←Z4,

Stage 3-4

(ra4=rc3)&!imm3: Y ← Z4);

Instruction-Level Parallelism

Increasing a processor's throughput

There are two ways to increase the number of instructions executed in a given time by a

processor

� By increasing the clock speed

� By increasing the number of instructions that can execute in parallel

Increasing the clock speed

� Increasing the clock speed is an IC design issue and depends on the advancements in

chip technology.

� The computer architect or logic designer can not thus manipulate clock speeds to

increase the throughput of the processor.

Increasing parallel execution of instructions

The computer architect cannot increase the clock speed of a microprocessor however

he/she can increase the number of instructions processed per unit time. In pipelining we

discussed that a number of instructions are executed in a staggered fashion, i.e. various

instructions are simultaneously executing in different segments of the pipeline. Taking

this concept a step further we have multiple data paths hence multiple pipelines can

execute simultaneously. There are two main categories of these kinds of parallel

instruction processors VLIW (very long instruction word) and superscalar.

The two approaches to achieve instruction-level parallelism are

Superscalar Architecture

A scalar processor that can issue multiple instructions simultaneously is said to be

superscalar

VLIW Architecture

A VLIW processor is based on a very long instruction word. VLIW relies on

instruction scheduling by the compiler. The compiler forms instruction packets which can

run in parallel without dependencies.

Page 221

Last Modified: 01-Nov-06

Advanced Computer Architecture-CS501

Difference between Pipelining and Instruction-Level Parallelism

Pipelining

Instruction-Level Parallelism

Single functional unit

Multiple functional units

Instructions are issued sequentially

Instructions are issued in parallel

Throughput increased by overlapping the Instructions are not overlapped but

instruction execution

executed in parallel in multiple functional

units

Very little extra hardware required to Multiple functional units within the CPU

implement pipelining

are required

Superscalar Architecture

A superscalar machine has following typical features

� It has one or more IUs (integer units) , FPUs (floating point units), and BPUs (branch

prediction units)

� It divides instructions into three classes

o Integer

o Floating point

o Branch prediction

The general operation of a superscalar processor is as follows

� Fetch multiple instructions

� Decode some of their portion to determine the class

� Dispatch them to the corresponding functional unit

As stated earlier the superscalar design uses multiple pipelines to implement instruction

level parallelism.

Operation of Branch Prediction Unit

�

BPU calculates the branch target address ahead of time to save CPU cycles

�

Branch instructions are routed from the queue to the BPU where target address is

calculated and supplied when required without any stalls

�

BPU also starts executing branch instructions by speculating and discards the results

if the prediction turns out to be wrong

Superscalar Design

The philosophy behind a superscalar design is

� to prefetch and decode as many instructions as possible before execution

Page 222

Last Modified: 01-Nov-06

Advanced Computer Architecture-CS501

�

and to start several branch instruction streams speculatively on the basis of this

decoding

�

and finally, discarding all but the correct stream of execution

The superscalar architecture uses multiple instruction issues and uses techniques such as

branch prediction and speculative instruction execution, i.e. it speculates on whether a

particular branch will be taken or not and then continues to execute it and the following

instructions. The results are not written back to the registers until the branch decision is

confirmed. Most superscalar architectures contain a reorder buffer. The reorder buffer

acts like an intermediary between the processor and the register file. All results are

written onto the reorder buffer and when the speculated course of action is confirmed, the

reorder buffer is committed to the register file.

Superscalar Processors

Examples of superscalar processors

o PowerPC 601

o Intel P6

o DEC Alpha 21164

VLIW Architecture

VLIW stands for "Very Long Instruction Word" typically 64 or 128 bits wide. The longer

instruction word carries information to route data to register files and execution units.

The execution-order decisions are made at the compile time unlike the superscalar design

where decisions are made at run time. Branch instructions are not handled very efficiently

in this architecture. VLIW compiler makes use of techniques such as loop unrolling and

code reordering to minimize dependencies and the occurrence of branch instructions.

Page 223

Last Modified: 01-Nov-06

Table of Contents: