advertisement

Optimizing Direct X On Multi Core Architectures

50 %
50 %
advertisement
Information about Optimizing Direct X On Multi Core Architectures

Published on May 19, 2008

Author: psteinb

Source: slideshare.net

Description

This slide set covers best practices in designing threaded rendering in PC games. Examples of current PC titles will be used throughout the talk to highlight the various points.
advertisement

Game Developers Conference 2008 Optimizing DirectX on Multi-core architectures Leigh Davies Senior Application Engineer, INTEL February 2008 [email_address] Contributions from; David Potages Grin* Jeff Andrews Intel ® Rita Turkowski Intel ® Kev Gee Microsoft* *Other names and brands may be claimed as the property of others

Contributions from;

David Potages Grin*

Jeff Andrews Intel ®

Rita Turkowski Intel ®

Kev Gee Microsoft*

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2008 Intel Corporation.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2008 Intel Corporation.

Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN * Summary *Other names and brands may be claimed as the property of others

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN *

Summary

Graphics is CPU Intensive. World in Conflict* Bionic Commando* D3D Runtime and Driver account for 25-40% of CPU cycles per frame *Other names and brands may be claimed as the property of others **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory. Application D3D Runtime Driver Other Legend Crysis* CPU Benchmark Crysis* GPU Benchmark

Designing the Rendering Pipeline. Analyze the whole program Your Application Direct API usage and overheads Video card driver Have Defined Performance Goals Use key game play targeted scenarios for perf analysis Build benchmarks / test levels Application Direct3D * Runtime Command Buffer Software Driver Video Card **Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx Render Functions *Other names and brands may be claimed as the property of others World in Conflict* 510-700 ZFUNC 1050-1150 DrawPrimative 2500-3100 SetTexture 1500-9000 SetPixelShaderConstant 3000-12100 SetVertexShader Cycles count DX9 API Call**

Analyze the whole program

Your Application

Direct API usage and overheads

Video card driver

Have Defined Performance Goals

Use key game play targeted scenarios for perf analysis

Build benchmarks / test levels

Balancing Future Workloads Intel ® Roadmap Graphics Compaction/Derivative Intel Core™ Duo · Pentium-D Intel Core™ Microarchitecture Intel Core™2 Duo, DC Intel Xeon® 5100 65nm 2 YEARS 45nm 2 YEARS Compaction/Derivative PENRYN New Microarchitecture NEHALEM Tick Tick Tock Tock Scalable & Configurable Cache, Interconnects & Memory Controllers Scalable Performance: 1 to 8 Threads & 1 to 4 Cores

Be realistic, Rendering Costs CPU Time Rendering thread potential bottleneck for N-Core scaling Rendering costs likely to increase as you add more physics, effects or even AI objects Runtime and driver costs are significantly higher on the PC than the consoles Use Performance Analysis results to focus development efforts Analyze regularly and catch regressions early Time is Money Optimise the graphics thread. Offload as much as possible.

Be realistic, Rendering Costs CPU Time

Rendering thread potential bottleneck for N-Core scaling

Rendering costs likely to increase as you add more physics, effects or even AI objects

Runtime and driver costs are significantly higher on the PC than the consoles

Use Performance Analysis results to focus development efforts

Analyze regularly and catch regressions early

Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN

Summary

Overview of Graphics Driver Models Windows * XP Display Model XPDM - DX* -  DX9 The Kernel mode driver controls threading Windows Vista * Display Driver Model WDDM - DX9 The D3D9 runtime manages creation of threads One is created specifically for the User Mode Driver (UMD) Windows Vista Display Driver Model WDDM - DX10 The Driver is responsible for creating threads Currently released drivers don’t thread Could change in the near future Graphics driver can have a major impact on performance and multi-core scaling. *Other names and brands may be claimed as the property of others

Windows * XP Display Model XPDM - DX* -  DX9

The Kernel mode driver controls threading

Windows Vista * Display Driver Model WDDM - DX9

The D3D9 runtime manages creation of threads

One is created specifically for the User Mode Driver (UMD)

Windows Vista Display Driver Model WDDM - DX10

The Driver is responsible for creating threads

Currently released drivers don’t thread

Could change in the near future

Profiling Tools Need to use a variety of tools; Use repeatable workload CPU Tools; VTune ™ Performance Analyser. Intel® Thread Profiler PIX for Windows * AMD Code Analyst ™ GPU Tools; PIX for Windows with vendor plugins NVIDIA * Perfhud ATI * PerfStudio *Other names and brands may be claimed as the property of others

Need to use a variety of tools;

Use repeatable workload

CPU Tools;

VTune ™ Performance Analyser.

Intel® Thread Profiler

PIX for Windows *

AMD Code Analyst ™

GPU Tools;

PIX for Windows with vendor plugins

NVIDIA * Perfhud

ATI * PerfStudio

Profiling Graphics with VTune™ Analyzer Select Counter Monitor for a quick overview; Not necessary to launch the app Disable display of counter data unless running windowed Profile across a selection of configurations Identify different bottlenecks based on h/w limitations “ Works great on my machine” isn’t good enough

Select Counter Monitor for a quick overview;

Not necessary to launch the app

Disable display of counter data unless running windowed

Profile across a selection of configurations

Identify different bottlenecks based on h/w limitations

“ Works great on my machine” isn’t good enough

VTune™ Performance Analyzer - Sampling Calibration isn’t needed for games Delay sampling allows alt-tab or bypass loading Tracking core usage needs to be added Privileged time shows time inside Kernel

Calibration isn’t needed for games

Delay sampling allows alt-tab or bypass loading

Tracking core usage needs to be added

Privileged time shows time inside Kernel

VTune ™ Analyzer Views Processor Usage Memory Usage Context Switching CPU Frequency VTune ™ Analyzer allows you to add your own counters.

Processor Usage

Memory Usage

Context Switching

CPU Frequency

Sampling - Display Model XPDM Application D3D Runtime Win32k & Dxg Display Driver Miniport Driver Videoport Kernel Mode User Mode Session Space

Sampling - Display Model WDDM Application D3D Runtime Win32k User Mode Driver Kernel Driver Dxgkrnl Kernel Mode User Mode DWM Process DWM Application Process CDD Session Space

Associating Symbols in VTune ™ Analyzer Configure->Options->Directories->Symbol Repository View Symbol Repository->Delete unassociated modules In Tuning Browser select "Results" -> "Module Associations..." Edit symbol associations

Configure->Options->Directories->Symbol Repository

View Symbol Repository->Delete unassociated modules

In Tuning Browser select "Results" -> "Module Associations..."

Edit symbol associations

Symbol Information for DX10Core.dll Symbols Taken while profiling SoftParticle Sample on SDK

PIX for Windows CPU GPU Gathering GPU events requires Windows Vista Cross over between PIX and VTune ™ Counters Easy to see CPU/GPU headroom

Gathering GPU events requires Windows Vista

Cross over between PIX and VTune ™ Counters

Easy to see CPU/GPU headroom

Intel ® PIX Plug-in: Beta Available Now Provides access to Intel ® Counters in PIX Rollout now to support IIG Profiling Description Metric Name # The aggregated percentage of time that the texture units were actively processing texels. Texture Unit(s) Utilization 16 The aggregated percentage of time that the mathbox was actively executing instructions. Mathbox Utilization 15 The number of pixels that were actually written to the render target. Pixels Drawn 14 The number of texels that were fetched by the pipeline. Texel Count 13 The number of triangles that flowed through the pipeline prior to any clipping or culling. Triangle Count 12 The number of vertices that entered the pipeline. Vertex Count 11 The percentage of time that the core array is actively executing instructions. Cores Active 10 The percentage of time that any core in the array is either actively executing instructions or stalled. Cores Busy 9 The percent utilization of the front end of the GPU.  This metric shall describe the incoming command stream and does NOT describe the utilization of the array of execution units (cores). GPU Busy 8 The amount of texture memory currently utilized, normalized to MB. Texture Memory Used 7 The amount of graphics memory currently utilized, normalized to bytes. Graphics Memory Used - bytes 6 The amount of graphics memory currently utilized, normalized to MB. Graphics Memory Used – MB 5 The amount of time spent in the display driver either busy stalled or in a sleep state, normalized to milliseconds. Driver Time Stalled 4 The amount of time spent in the display driver, normalized to milliseconds. Driver Time 3 Instantaneous frame rate normalized to seconds. (inverted frame time). Frames per Second 2 Instantaneous frame time in milliseconds. Frame Time 1

Provides access to Intel ® Counters in PIX

Rollout now to support IIG Profiling

Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN

Summary

Starting Points Common Issues: Naive Ports to Windows from console models Excessive context switching/synchronization overhead Work starvation due to thread sync dependencies General Rules Use only 1 heavy weight thread per Core on Windows Manage Job distribution The OS scheduler knows best Consider memory bandwidth Multi-core and D3D Usage Avoid Use of the D3DCREATE_MULTITHREADED flag You CAN manage synch costs better Design around a single threaded D3D Device Access model Lock resources from main thread, manually protect access *Other names and brands may be claimed as the property of others

Common Issues:

Naive Ports to Windows from console models

Excessive context switching/synchronization overhead

Work starvation due to thread sync dependencies

General Rules

Use only 1 heavy weight thread per Core on Windows

Manage Job distribution

The OS scheduler knows best

Consider memory bandwidth

Multi-core and D3D Usage

Avoid Use of the D3DCREATE_MULTITHREADED flag

You CAN manage synch costs better

Design around a single threaded D3D Device Access model

Lock resources from main thread, manually protect access

Making the Drivers Work for You! Pack your DrawPrimitive2 calls together Frequently creating & destroying shaders, VB, IB, and surfaces will impact performance Avoid allocating too many system memory resources DrawPrimitiveUP or DrawIndexedPrimitiveUP App App D3D Runtime D3D Driver D3D Driver Potential 20%+ speed gain. Can be disabled by application behaviour. Producer & Consumer threads dispatch commands to GPU

Pack your DrawPrimitive2 calls together

Frequently creating & destroying shaders, VB, IB, and surfaces will impact performance

Avoid allocating too many system memory resources

DrawPrimitiveUP or DrawIndexedPrimitiveUP

Potential 20%+ speed gain.

Can be disabled by application behaviour.

Producer & Consumer threads dispatch commands to GPU

Avoid any calls that return GPU state information, requires a CPU thread synchronization Driver Queries are OK (calls are asynchronous) Do not lock threads to a specific CPU! Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter them among drawing calls Minimize use of any locks/unlocks System Memory Vertex Buffers D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE Making the Drivers Work for You!

Avoid any calls that return GPU state information, requires a CPU thread synchronization

Driver Queries are OK (calls are asynchronous)

Do not lock threads to a specific CPU!

Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter them among drawing calls

Minimize use of any locks/unlocks

System Memory Vertex Buffers

D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY

Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE

Threading Issues Race Conditions between threads. Object Updates Creation/deletion of objects False sharing of data between threads. Accessing hardware resources. Render Thread Main Thread Time (Frame n) (Frame n-1) Move Object X Render Object X Delete Object Y Render Object Y

Race Conditions between threads.

Object Updates

Creation/deletion of objects

False sharing of data between threads.

Accessing hardware resources.

Threading Options Front- End Logic EOF EOF Front- end Logic Back-end Render Cmd Queue Back-end Render Avoiding the Issues Use an update queue, lightweight (lock-free?) Make duplicate objects/ double-buffered Reference count objects Pipeline Consumer thread

Avoiding the Issues

Use an update queue, lightweight (lock-free?)

Make duplicate objects/ double-buffered

Reference count objects

Buffering Dynamic Data Partially buffered locks consume more video memory. Fully Buffered consume more system memory and have an associated CPU cost for memory copying. Fully buffered locks Partially buffered locks Render Thread Main Thread (Frame n) (Frame n-1) Modify Vertex Buffer 0 Render Object from Vertex Buffer 1 Render Thread Main Thread Modify Vertex Buffer 1 Render Object from Vertex Buffer 0 (Frame n+1) (Frame n) Main Thread Render Thread

Partially buffered locks consume more video memory.

Fully Buffered consume more system memory and have an associated CPU cost for memory copying.

Sub Threading Options Front- End Logic EOF Back-end Render Job Job Job Job Queue Job Queue offloads Software Visibility Culling Particle generation Character Skinning Procedural updates Reduces path size through both front and back ends Job Job Job Job Queue

Job Queue offloads

Software Visibility Culling

Particle generation

Character Skinning

Procedural updates

Reduces path size through both front and back ends

Threading the DX API Similar to DX9 threading in the runtime Potentially repeating the same work Potential to move simple API code out of main thread, i.e. state management DX10 has lower runtime costs D3D9Wrapper D3DVertexBuffer9 Wrapper D3DDevice9 Wrapper DX9 Render System D3D9 D3DDevice9 D3DVertexBuffer9 Graphics Driver Graphics Device DX9 DX10 16% increase* 39% increase* * Theoretical increase based on amount of API work offloaded, does not include threading overhead** **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory. 19.35 Other threads 10.91 Physics 23.02 NVIDIA driver 46.46 (15.82%) in DX9 Main Thread 21.88 Other threads 13.95 Physics 63.84 (28.39% in DX10+Driver) Main Thread 7.38 DX API Thread 19.35 Other threads 10.91 Physics 23.02 NVIDIA driver 39.08 Main Thread 18.12 DX API Thread 21.88 Other threads 13.95 Physics 45.72 Main Thread

Similar to DX9 threading in the runtime

Potentially repeating the same work

Potential to move simple API code out of main thread, i.e. state management

DX10 has lower runtime costs

Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary *Other names and brands may be claimed as the property of others

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN

Summary

Case study: Grin’s engine * *Other names and brands may be claimed as the property of others David Potages Senior Engine Architect, GRIN February 2008 [email_address] *Performance figures discussed in this case study refer to a pre release version of the game. They are subject to change before release and are for illustration only.

Quick Engine Overview 3 rd generation of threaded engine 2 nd generation of threaded renderer Used in several games

3 rd generation of threaded engine

2 nd generation of threaded renderer

Used in several games

Quick Engine Overview Not game specific: game code in Lua scripts Allows hot-reload, no link time, custom debugger But single threaded, a lot of memory allocations Deferred rendering DX9 – DX10 being implemented Libraries: PhysX ™ OpenAL Bink* All the technology choices have great impact on the possible parallelization! *Other names and brands may be claimed as the property of others

Not game specific: game code in Lua scripts

Allows hot-reload, no link time, custom debugger

But single threaded, a lot of memory allocations

Deferred rendering

DX9 – DX10 being implemented

Libraries:

PhysX ™

OpenAL

Bink*

Why multi-threading? Poor CPU usage Can go down to 30% A lot of time spent in D3D/driver 35-45%* But a lot of the application time is dedicated to rendering Up to 37%* Grand total of 53%* of frame with D3D/driver *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. Application D3D Runtime Driver Other Legend

Poor CPU usage

Can go down to 30%

A lot of time spent in D3D/driver

35-45%*

But a lot of the application time is dedicated to rendering

Up to 37%*

Grand total of 53%* of frame with D3D/driver

Why multi-threading the renderer? Simplified pipeline (ST version) Rendering is an easy target for multithreading: low system dependencies, 53% of frame time But easier said than done! Some systems or the drivers they use can take advantage of multi-cores Rendering has low dependencies with other systems, but big data dependencies *Other names and brands may be claimed as the property of others Culling Particles batch optimizations Rendering World update Script update Sound Network Lua * PhysX ™ OpenAL *

Simplified pipeline (ST version)

Rendering is an easy target for multithreading: low system dependencies, 53% of frame time

But easier said than done!

Implementation Details Main thread Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main) Renderer thread Culling (including software occlusion queries) Particle effects batch optimizations RenderDevice (D3D) Win32 messaging Other File streaming PhysX ™ threads Driver threads

Main thread

Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main)

Renderer thread

Culling (including software occlusion queries)

Particle effects batch optimizations

RenderDevice (D3D)

Win32 messaging

Other

File streaming

PhysX ™ threads

Driver threads

Implementation Details Messages sent to the renderer Non blocking: render_scene render_frame update_window Etc Blocking: flush_pipe flush_pipe forces the renderer to execute all the queued jobs => synchronization point Used between frames on main thread Can be used to ensure that data (eg Textures) is ready Front- end Logic Back-end Render Flush Back-end Render Idle Front- end Logic Sync Idle Flush

Messages sent to the renderer

Non blocking:

render_scene

render_frame

update_window

Etc

Blocking:

flush_pipe

flush_pipe forces the renderer to execute all the queued jobs => synchronization point

Used between frames on main thread

Can be used to ensure that data (eg Textures) is ready

Implementation Details States needs to be mirrored States changes are queued, and updated in the freeze The proper state is returned depending on the calling thread This will avoid contention when data is accessed in the renderer, but mirror only what is required

States needs to be mirrored

States changes are queued, and updated in the freeze

The proper state is returned depending on the calling thread

Results Better CPU usage 40-60%* Better threads workload *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

Better CPU usage 40-60%*

Better threads workload

Results: Rendering Performance Better FPS 4C MT is 1.88x faster than 1C * 4C MT is 1.20x faster than 4C ST * Analysis Remember that the drivers are partially threaded: we save up to 17% + %of D3D/driver time that is not threaded Close to 1.20x if D3D/driver were completely threaded, new frame time would be 1-0.17=83% less, and the scale-up : fps new /fps old =time old /time new =time old /(time old *0.83)=1.20 Maximum scale-up vs. 1C is 2.12x Context switches, cache misses and contention slow us down. Render-thread bound *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. Effect on a low physics/gameplay workload

Better FPS

4C MT is 1.88x faster than 1C *

4C MT is 1.20x faster than 4C ST *

Analysis

Remember that the drivers are partially threaded: we save up to 17% + %of D3D/driver time that is not threaded

Close to 1.20x

if D3D/driver were completely threaded, new frame time would be 1-0.17=83% less, and the scale-up : fps new /fps old =time old /time new =time old /(time old *0.83)=1.20

Maximum scale-up vs. 1C is 2.12x

Context switches, cache misses and contention slow us down.

Render-thread bound

Effect on a low physics/gameplay workload

Improvements Threading some parts of the render thread E.g.: culling (~9-25%* of the render thread) Reducing contentions Mainly memory Batch more E.g.: Effects Triple buffering? *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.

Threading some parts of the render thread

E.g.: culling (~9-25%* of the render thread)

Reducing contentions

Mainly memory

Batch more

E.g.: Effects

Triple buffering?

Scalability We can push for instance more physics/effects, while we are render-thread bound, or more AI But hard to find the right balance between CPU and GPU workload! Example: falling cars aka pushing more physics

We can push for instance more physics/effects, while we are render-thread bound, or more AI

But hard to find the right balance between CPU and GPU workload!

Example: falling cars aka pushing more physics

Scalability ~256 cars falling and bouncing 4C MT is 1.42x* faster than 4C ST, and 3.23x* faster than 1C PhysX ™ helped us a lot to propagate the workload, but occupies the other cores quite heavily, thus preventing D3D/drivers to take advantage of them. Rendering overhead was not that big with the additional units since they batch well. *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.

~256 cars falling and bouncing

4C MT is 1.42x* faster than 4C ST, and 3.23x* faster than 1C

PhysX ™ helped us a lot to propagate the workload, but occupies the other cores quite heavily, thus preventing D3D/drivers to take advantage of them.

Rendering overhead was not that big with the additional units since they batch well.

Issues A proper benchmark system is required A fly-through benchmark is not enough! The CPU & GPU workloads vary a lot on different maps Easy to forget a data that needs to be mirrored Lockfree algorithm are nice, but to be used with care Memory contention + cache misses + false sharing Behaviour of drivers varies quite alot…

A proper benchmark system is required

A fly-through benchmark is not enough!

The CPU & GPU workloads vary a lot on different maps

Easy to forget a data that needs to be mirrored

Lockfree algorithm are nice, but to be used with care

Memory contention + cache misses + false sharing

Behaviour of drivers varies quite alot…

Agenda Graphics and the CPU Profiling Graphics and Drivers Threading the render thread Case Study GRIN Summary *Other names and brands may be claimed as the property of others

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN

Summary

Summary/Conclusion Graphic pipeline is still very CPU intensive Future CPUs will have increasing logical processors It is worth threading your renderer as much as possible if you want to be able to push more things in your game Hard to balance the workloads though, need to profile whole system Making the most of the graphics driver essential

Graphic pipeline is still very CPU intensive

Future CPUs will have increasing logical processors

It is worth threading your renderer as much as possible if you want to be able to push more things in your game

Hard to balance the workloads though, need to profile whole system

Making the most of the graphics driver essential

References: Accurately Profiling Direct3D API Calls. msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx Debugging Tools and Symbols: Getting Started www.microsoft.com/whdc/devtools/debugging/debugstart.mspx Threading the OGRE3D Render System www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm

Accurately Profiling Direct3D API Calls.

msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx

Debugging Tools and Symbols: Getting Started

www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

Threading the OGRE3D Render System

www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm

 

Add a comment

Related pages

Sponsored Video: Optimizing DirectX on Multi-Core ...

Sponsored Video: Optimizing DirectX on Multi-Core Architecture Part 1. Intel technical marketing engineer Brad Werth delivers this week featured video, ...
Read more

An Optimization for MapReduce Frameworks in Multi-core ...

Direct export . Export ... we promote an extension of the original strategy of MapReduce for multi-core architectures. ... ICCS 2013 An Optimization for ...
Read more

Optimizing the Fast Fourier Transform on a Multi-core ...

Optimizing the Fast Fourier Transform on a Multi-core Architecture ... optimizing the Fast Fourier Transform ... optimization for C64-like large-scale ...
Read more

Optimizing process creation and execution on multi-core ...

... 1.477 | Ranking: Computer Science, Hardware & Architecture 13 out ... towards optimizing ... executable on specific cores.
Read more

Architecture-based design and optimization of genetic ...

1. Introduction. Nowadays, multi-core processors and many-core GPUs have entered the mainstream of microprocessor development. The multi-core and many-core ...
Read more

Performance Optimization for the Intel Atom Architecture

Performance Optimization for the Intel Atom Architecture ... support has a direct impact ... specific to the Intel Atom processor. Multi-Core ...
Read more

Optimizing fast fourier transform on a multi-core ...

We demonstrate how multi-core architectures like the ... our study demonstrates that successful optimization for C64like large-scale multi-core ...
Read more

Optimizing DMA Data Transfers for Embedded Multi-Cores

Optimizing DMA Data Transfers for Embedded ... (Direct Memory Access) engine: ... Heterogeneous Multi-core Architectures
Read more

An Asymptotic Performance/Energy Analysis and Optimization ...

and Optimization of Multi-Core Architectures Jeong-Gun Lee1, ... of multi-core architectures including different core size and different number of
Read more