Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU

50 %
50 %
Information about Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA...

Published on March 28, 2011

Author: abhijeetnawal

Source: slideshare.net

Description

Briefly introduces the features of NETBURST, Core and The Nehalem architecture of INTEL.
Along with the Heterogeneous NVIDIA Tegra GPGPU

An Architecture Perspective On Modern Microprocessors And GPU - AbhijeetNawal 3/25/2011 1 AN ARCHITECTURE PERSPECTIVE

Agenda INTRODUCTION INTEL’S NETBURST ARCHITECTURE INTEL’S CORE ARCHITECTURE INTEL’S NEHALEM ARCHITECTURE SNEAK PEAK AT NVIDIA TEGRA GPU REFERENCES 3/25/2011 2 AN ARCHITECTURE PERSPECTIVE

Introduction Super Scalar Homogeneous Processors From Intel.Performance = Frequency x IPCPower = Dynamic Capacitance x Volts x Volts x Frequency.Dynamic Capacitance is the ratio of the electrostatic charge on a conductor to the potential difference between the conductors required to maintain that charge.Higher the No Of Pipeline Stages more Instructions in Pipeline.Higher No Of Pipeline Stages reduces IPC as n/{k+(n-1)} .Low IPC is offset by increasing the clock rate and reducing stage time.Each Instruction is CISC based so decodes into micro operations.3/25/2011 3 AN ARCHITECTURE PERSPECTIVE

Super Scalar Homogeneous Processors From Intel.

Performance = Frequency x IPC

Power = Dynamic Capacitance x Volts x Volts x Frequency.

Dynamic Capacitance is the ratio of the electrostatic charge on a conductor to the potential difference between the conductors required to maintain that charge.

Higher the No Of Pipeline Stages more Instructions in Pipeline.

Higher No Of Pipeline Stages reduces IPC as n/{k+(n-1)} .

Low IPC is offset by increasing the clock rate and reducing stage time.

Each Instruction is CISC based so decodes into micro operations.

Introduction… Streaming SIMD Extensions:SSE instructions are 128-bit integer arithmetic and 128-bit SIMD double precision floating-point operations.They reduce the overall number of instructions required to execute a particular program task.They accelerate a broad range of applications, including video, speech and image, photo processing, encryption, financial, engineering and scientific applications.Predecode phase:Before Instruction pipleline fetch and decode phase.Bundles instructions to be parallelly executed.Instructions are appended with bits after fetching from memory as they enter the instruction cache.This unit also has to thus take care of analyzing the structural, control and data hazards. 3/25/2011 AN ARCHITECTURE PERSPECTIVE 4

Streaming SIMD Extensions:

SSE instructions are 128-bit integer arithmetic and 128-bit SIMD double precision floating-point operations.

They reduce the overall number of instructions required to execute a particular program task.

They accelerate a broad range of applications, including video, speech and image, photo processing, encryption, financial, engineering and scientific applications.

Predecode phase:

Before Instruction pipleline fetch and decode phase.

Bundles instructions to be parallelly executed.

Instructions are appended with bits after fetching from memory as they enter the instruction cache.

This unit also has to thus take care of analyzing the structural, control and data hazards.

Intel Architectures: Netburst 3/25/2011 5 AN ARCHITECTURE PERSPECTIVE

NetBurst Architecture 3/25/2011 6 AN ARCHITECTURE PERSPECTIVE

Netburst Microarchitecture 3/25/2011 7 AN ARCHITECTURE PERSPECTIVE

Features of Netburst Architecture Hyper Threading:A processor appears as two logical processors.Each logical processor has its own set of registers, APIC( Advanced programmable interrupt controller).Increases resource utilization and improve performance.Introduced SSE (Streaming SIMD Extensions)3.0Added some DSP-oriented instructions .And some process (thread) management instructions.3/25/2011 8 AN ARCHITECTURE PERSPECTIVE

Hyper Threading:

A processor appears as two logical processors.

Each logical processor has its own set of registers, APIC( Advanced programmable interrupt controller).

Increases resource utilization and improve performance.

Introduced SSE (Streaming SIMD Extensions)3.0

Added some DSP-oriented instructions .

And some process (thread) management instructions.

Features of Netburst… Hyper Pipelined Technology:20 stage pipeline.Branch Mispredictions can lead to very costly pipeline flushes.Techniques to hide stall penalties are parallel execution, buffering and speculation. Three Major Components:In-Order Issue Front End Out-Of-Order Superscalar Execution Core In-Order Retirement Unit 3/25/2011 AN ARCHITECTURE PERSPECTIVE 9

Hyper Pipelined Technology:

20 stage pipeline.

Branch Mispredictions can lead to very costly pipeline flushes.

Techniques to hide stall penalties are parallel execution, buffering and speculation.

Three Major Components:

In-Order Issue Front End

Out-Of-Order Superscalar Execution Core

In-Order Retirement Unit

Features of Netburst… In-Order Issue Front End:Two major parts:Fetch/Decode Unit Execution Trace CacheFetch/ Decode Unit:Prefetches IA-32 instructions that are likely to be executed. Details in Prefetching.Fetches instructions that have not already been prefetched.Decodes instructions into µops and builds trace.3/25/2011 AN ARCHITECTURE PERSPECTIVE 10

In-Order Issue Front End:

Two major parts:

Fetch/Decode Unit

Execution Trace Cache

Fetch/ Decode Unit:

Prefetches IA-32 instructions that are likely to be executed. Details in Prefetching.

Fetches instructions that have not already been prefetched.

Decodes instructions into µops and builds trace.

Features of Netburst… Execution Trace Cache:Middle-man between First Decode Stage and Execution StageCaches the decoded micro operations of repeating instruction sequences avoiding re-decode.Caches the branch targets and delivers µops to execution.Rapid Execution Engine:Arithmetic Logic Units (ALUs) run at twice the processor frequency and thus offset the low IPC factor.Basic integer operations executes in 1/2 processor clock tick.Provides higher throughput and reduced latency of execution.3/25/2011 AN ARCHITECTURE PERSPECTIVE 11

Execution Trace Cache:

Middle-man between First Decode Stage and Execution Stage

Caches the decoded micro operations of repeating instruction sequences avoiding re-decode.

Caches the branch targets and delivers µops to execution.

Rapid Execution Engine:

Arithmetic Logic Units (ALUs) run at twice the processor frequency and thus offset the low IPC factor.

Basic integer operations executes in 1/2 processor clock tick.

Provides higher throughput and reduced latency of execution.

Features of Netburst… Out of Order Core:Contains multiple execution hardware resources to execute multiple µops parallel.µops contending for a resource are buffered.Meanwhile other µops are executed. Dependency among µops is taken care by appropriate buffering and in ordered retirement logic of retirement unit.Register renaming logic aids to resolving conflicts.Up to three µops may be retired per cycle.3/25/2011 AN ARCHITECTURE PERSPECTIVE 12

Out of Order Core:

Contains multiple execution hardware resources to execute multiple µops parallel.

µops contending for a resource are buffered.

Meanwhile other µops are executed.

Dependency among µops is taken care by appropriate buffering and in ordered retirement logic of retirement unit.

Register renaming logic aids to resolving conflicts.

Up to three µops may be retired per cycle.

Features of Netburst… The Branch Predictor :Dynamically predict the target of a branch instruction based on its linear address using branch target buffer.If none/invalid dynamic prediction is available, statically predicts based on the offset of the target.A backward branch is predicted to be taken, a forward branch is predicted to be not taken. Return addresses are predicted using the 16-entry return address stack.It does not predict far transfers, for example, far calls, interrupt returns and software interrupts.3/25/2011 AN ARCHITECTURE PERSPECTIVE 13

The Branch Predictor :

Dynamically predict the target of a branch instruction based on its linear address using branch target buffer.

If none/invalid dynamic prediction is available, statically predicts based on the offset of the target.

A backward branch is predicted to be taken, a forward branch is predicted to be not taken.

Return addresses are predicted using the 16-entry return address stack.

It does not predict far transfers, for example, far calls, interrupt returns and software interrupts.

Features of Netburst… Prefetching: By three techniques-Hardware Instruction FetcherPrefetch Instructions Hardware to fetch data and instructions directly to second level cache.Caching:Supports upto 3 levels of caches.All being exclusive.First Level: Separate data and instruction Cache and trace cache.3/25/2011 AN ARCHITECTURE PERSPECTIVE 14

Prefetching: By three techniques-

Hardware Instruction Fetcher

Prefetch Instructions

Hardware to fetch data and instructions directly to second level cache.

Caching:

Supports upto 3 levels of caches.

All being exclusive.

First Level: Separate data and instruction Cache and trace cache.

Heading to Core 3/25/2011 15 AN ARCHITECTURE PERSPECTIVE

Core Microachitecture 3/25/2011 16 AN ARCHITECTURE PERSPECTIVE

Core Microarchitecture 3/25/2011 17 AN ARCHITECTURE PERSPECTIVE

Features of Core Architecture Wide Dynamic Execution:Each Core is wider and can fetch, decode and execute 4 instructions at a time.Netburst could however execute only 3.So a quad core processor executes 16 at once.It has added more simple decoder than Netburst.Decoders decoding x86 instructions:Simple: translating to one micro operation.Complex: translating to more than one micro op.3/25/2011 AN ARCHITECTURE PERSPECTIVE 18

Wide Dynamic Execution:

Each Core is wider and can fetch, decode and execute 4 instructions at a time.

Netburst could however execute only 3.

So a quad core processor executes 16 at once.

It has added more simple decoder than Netburst.

Decoders decoding x86 instructions:

Simple: translating to one micro operation.

Complex: translating to more than one micro op.

Wide Dynamic Execution… Macrofusion:In previous generation processors, each incoming instruction was individually decoded and executed.Macrofusion enables common instruction pairs (such as a compare followed by a conditional jump) to be combined into a single internal instruction (micro-op) during decoding.Increases the overall IPC and energy efficiency.The architecture uses an enhanced Arithmetic Logic Unit (ALU) to support Macrofusion.3/25/2011 AN ARCHITECTURE PERSPECTIVE 19

Macrofusion:

In previous generation processors, each incoming instruction was individually decoded and executed.

Macrofusion enables common instruction pairs (such as a compare followed by a conditional jump) to be combined into a single internal instruction (micro-op) during decoding.

Increases the overall IPC and energy efficiency.

The architecture uses an enhanced Arithmetic Logic Unit (ALU) to support Macrofusion.

Wide Dynamic Execution… 3/25/2011 AN ARCHITECTURE PERSPECTIVE 20

Advanced Digital Media Boost Netburst executed 128 bit SSE instructions in two cycles taking 64 bits at one cycle.Core executes one 128 bit SSE in 1 clock cycle.3/25/2011 21 AN ARCHITECTURE PERSPECTIVE

Netburst executed 128 bit SSE instructions in two cycles taking 64 bits at one cycle.

Core executes one 128 bit SSE in 1 clock cycle.

Smart Memory Access Memory disambiguation:Intelligent algorithms for identifying which loads are independent of stores or are okay to load ahead of stores ensuring that no data location dependencies are violated.If at all the load is invalid, then it detects the conflict, reloads the correct data and re-executes the instruction.3/25/2011 AN ARCHITECTURE PERSPECTIVE 22

Memory disambiguation:

Intelligent algorithms for identifying which loads are independent of stores or are okay to load ahead of stores ensuring that no data location dependencies are violated.

If at all the load is invalid, then it detects the conflict, reloads the correct data and re-executes the instruction.

Advanced Smart Cache Each execution core shares L2 cache instead of a separate one for each core. The data only has to be stored in one place that each core can access thereby optimizing cache resources. When one core has minimal cache requirements, other cores can increase their percentage of L2 cache.Load based sharing reduces cache misses and increasing performance. Advantage is higher cache hit rate, reduced bus traffic and lower latency to data. 3/25/2011 AN ARCHITECTURE PERSPECTIVE 23

Each execution core shares L2 cache instead of a separate one for each core.

The data only has to be stored in one place that each core can access thereby optimizing cache resources.

When one core has minimal cache requirements, other cores can increase their percentage of L2 cache.

Load based sharing reduces cache misses and increasing performance.

Advantage is higher cache hit rate, reduced bus traffic and lower latency to data.

Intelligent Power Capability Manages the runtime power consumption of all the processor’s execution cores. Includes an advanced power gating capability in which an ultra fine-grained logic control turns on individual processor logic subsystems only if and when they are needed. Has many buses and arrays are split so that data required in some modes of operation can be put in a low power state when not needed.Implementing power gating reduced the power footprint to a great extent compared to previous processors.3/25/2011 24 AN ARCHITECTURE PERSPECTIVE

Manages the runtime power consumption of all the processor’s execution cores.

Includes an advanced power gating capability in which an ultra fine-grained logic control turns on individual processor logic subsystems only if and when they are needed.

Has many buses and arrays are split so that data required in some modes of operation can be put in a low power state when not needed.

Implementing power gating reduced the power footprint to a great extent compared to previous processors.

Heading to Nehalem 3/25/2011 25 AN ARCHITECTURE PERSPECTIVE

Nehalem Architecture 3/25/2011 26 AN ARCHITECTURE PERSPECTIVE Quick Path Technology Turbo Boost Technology Hyper Threading Smarter Cache IPC Improvements Enhanced Branch Prediction Application Targeted Accelerators and SSE 4.0 Intelligent Power Technology Enhanced Virtualization Technology support Enhancements Over Core Microarchitecture

Quick Path Technology

Turbo Boost Technology

Hyper Threading

Smarter Cache

IPC Improvements

Enhanced Branch Prediction

Application Targeted Accelerators and SSE 4.0

Intelligent Power Technology

Enhanced Virtualization Technology support

Enhancements Over Core Microarchitecture

Add a comment

Related presentations

Related pages

NVIDIA Tegra 4 Family GPU Architecture - Visual Computing ...

P a g e | 5 NVIDIA Tegra 4 GPU Architecture February 2013 Figure 1 NVIDIA Tegra 4 Family Architecture Tegra 4 Family GPU Logical Pipeline Flow
Read more

Whitepaper NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 ...

4 Figure 1 NVIDIA Tegra 4 Family NVIDIA Tegra 4i is based on the Tegra 4 architecture and it brings the Tegra 4 super phone experiences into the mainstream ...
Read more

Preview: NVIDIA Tegra K1 Compared To Various x86, ARM ...

Here's the first of some interesting numbers compared to other Intel x86 and ARM platforms. The Tegra K1 ... modern, mainline architecture ... NVIDIA's ...
Read more

Fermi (microarchitecture) - Wikipedia, the free encyclopedia

Nvidia Fermi; History; Predecessor: ... connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of ... The Fermi architecture uses a two ...
Read more

NVIDIA Brings Kepler, World’s Most Advanced Graphics ...

... we’re giving a sneak peek at the GPU ... Project Logan’s GPU is based on our revolutionary Kepler architecture, ... it’s running on the Kepler ...
Read more

List of Intel microprocessors - Wikipedia, the free ...

This generational and chronological list of Intel processors attempts to present ... a peak GPU turbo ... the most Intel Architecture cores ever ...
Read more

Atom vs. Cortex-A15 vs. Krait vs. Tegra 3: Which mobile ...

Intel's Atom, Qualcomm's Krait, Nvidia's ... (which is a much newer architecture than both Tegra 3 and ... Book laptop with Intel Skylake, discrete Nvidia ...
Read more

Gameworks - NVIDIA Developer

... profile and trace your Graphics and GPU Computing ... Get a small sneak peak at ... courtesy of NVIDIA GameWorks' groundbreaking PhysX Flex ...
Read more