Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars

67 %
33 %
Information about Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars

Published on May 19, 2008

Author: psteinb

Source: slideshare.net

Description

This talk will briefly discuss performance threading of Quake4 and Quake Wars Engine. It will go over the issues involved parallelizing serial code, working with different backends, load balancing and design considerations. It will also offer some insight into extracting parallelism in game engines on next-generation hardware.Get a first-time look at Havok Behavior 5.5, demonstrating how the Havok Behavior Tool combines the fidelity of traditional animation assets with powerful physical and procedural animation techniques in a single creative environment. View Havok’s extensible end-to-end character content creation pipeline spanning physics, animation, and real-time behavior asset composition and conditioning.

Threading Game Engines - QUAKE 4 & Enemy Territory QUAKE Wars Anu Kalra - Intel Corporation Jan Paul van Waveren - id Software Feb 21, 2006

Agenda Concurrency In Games Today Analysis of QUAKE 4 Renderer Threading QUAKE 4 and ETQW AI & Mega texture Threading in ETQW Common Performance Issues & Workarounds Building Scalability into threading design

Concurrency In Games Today

Analysis of QUAKE 4

Renderer Threading QUAKE 4 and ETQW

AI & Mega texture Threading in ETQW

Common Performance Issues & Workarounds

Building Scalability into threading design

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2008 Intel Corporation.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2008 Intel Corporation.

Concurrency In Games There has been a dramatic increase in compute power in consumer space in the last few years with multi-core Game industry has started the move to adopt concurrent programming Most multithreaded games today still follow the first generation of parallelism i.e. threading based on functional decomposition. Game is broken up into various subsystems each of which run on their own thread typically rendering, and AI sometimes physics too

There has been a dramatic increase in compute power in consumer space in the last few years with multi-core

Game industry has started the move to adopt concurrent programming

Most multithreaded games today still follow the first generation of parallelism i.e. threading based on functional decomposition.

Game is broken up into various subsystems each of which run on their own thread typically rendering, and AI sometimes physics too

QUAKE 4 Engine The Engine is split up into 3 main Components The QUAKE 4 Engine (exe) idlib common library for all is stuff (math, timing , algorithms, memory management, parsers,… ) linked statically very well optimized with SSE,SSE2, SSE3. The Game DLL – the basic game dll implements classes specific to the game like Weapons, Vehicles, Characters, Script engine, AI, Game physics,… calls into the QUAKE 4 Engine for all of the lower level work.

The Engine is split up into 3 main Components

The QUAKE 4 Engine (exe)

idlib common library for all is stuff (math, timing , algorithms, memory management, parsers,… ) linked statically very well optimized with SSE,SSE2, SSE3.

The Game DLL – the basic game dll implements classes specific to the game like Weapons, Vehicles, Characters, Script engine, AI, Game physics,… calls into the QUAKE 4 Engine for all of the lower level work.

QUAKE 4 Analysis As per the V-tune Analysis QUAKE 4 was CPU bound Predominantly Single threaded Roughly equal amount is being spent in the driver and the engine 41% & 49% respectively Each of the major hotspots consume 2-4% of CPU time. Peek into the source revealed QUAKE 4 had a good separation between the renderer Front and Back end. Most of the time spent in the OpenGL driver came from the Renderer Backend.

As per the V-tune Analysis QUAKE 4 was

CPU bound

Predominantly Single threaded

Roughly equal amount is being spent in the driver and the engine 41% & 49% respectively

Each of the major hotspots consume 2-4% of CPU time.

Peek into the source revealed

QUAKE 4 had a good separation between the renderer Front and Back end.

Most of the time spent in the OpenGL driver came from the Renderer Backend.

Constraints Threading an existing engine Time frame 4-6 months Target platform – P4 dual core (3.2 Ghz) Single core performance difference had to be less than 5%

Threading an existing engine

Time frame 4-6 months

Target platform – P4 dual core (3.2 Ghz)

Single core performance difference had to be less than 5%

Threading To get most performance in a constrained time decided to functionally decompose the 2 largest blocks. Split the render into front-end and back-end The backend was made to run on its own thread The front-end and back-end communicated through command queues and synchronization events Frontend Backend Thread 1 Thread 2 Cmd Queue Backend

To get most performance in a constrained time decided to functionally decompose the 2 largest blocks.

Split the render into front-end and back-end

The backend was made to run on its own thread

The front-end and back-end communicated through command queues and synchronization events

Threading The frame was prepared by the front end handed over to the back end while the front end prepared the next frame. Data specific to a frame was double buffered Data had to be allocated and freed safely. Front end managed allocation & deallocation of shared data. Data to be freed was kept till the backend was done and cleared at the front end just before reuse. Subsystems that were not thread safe had to be made thread safe models classes, animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, …

The frame was prepared by the front end handed over to the back end while the front end prepared the next frame.

Data specific to a frame was double buffered

Data had to be allocated and freed safely.

Front end managed allocation & deallocation of shared data. Data to be freed was kept till the backend was done and cleared at the front end just before reuse.

Subsystems that were not thread safe had to be made thread safe models classes, animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, …

Synchronization Front End Backend Frame n Frame n-1 Frame n Frame n+1 Frame n+2 Frame n+1

Issues with Threading Debugging The threaded code was the hardest Issues could be broadly categorized into 3 major types Data Race Conditions Object lifetime issues OpenGL context issues Moved all the time critical OpenGL calls to the backend used a synch mechanism for others Added a realtime toggle capaility to turn threading on and off along with a lock step mode to the threaded code where the front end and back end would run on separate threads but run lock step Used Synchronization points to slowly & painfully eliminate Data Races Added lots of initialization and destruction code to deal with lifetime issues Needed to batch certain commands to improve performance

Debugging The threaded code was the hardest

Issues could be broadly categorized into 3 major types

Data Race Conditions

Object lifetime issues

OpenGL context issues

Moved all the time critical OpenGL calls to the backend used a synch mechanism for others

Added a realtime toggle capaility to turn threading on and off along with a lock step mode to the threaded code where the front end and back end would run on separate threads but run lock step

Used Synchronization points to slowly & painfully eliminate Data Races

Added lots of initialization and destruction code to deal with lifetime issues

Needed to batch certain commands to improve performance

Performance Improvements Beta timeframe

Beta timeframe

Multi-Threaded Drivers Driver Thread Driver FIFO Sound Thread Game Engine Loop OpenGL/D3D Main Thread Graphics Driver Front End 3D HW

Current Performance

Renderer Threading with ETQW The whole renderer runs in a separate thread More work being done on the renderer thread Culling and shadow volume construction Reduces amount of memory being buffered and shared between threads Triangle meshes are not double buffered Better splitting of work on 2 cores Works better with multi-threaded drivers

The whole renderer runs in a separate thread

More work being done on the renderer thread

Culling and shadow volume construction

Reduces amount of memory being buffered and shared between threads

Triangle meshes are not double buffered

Better splitting of work on 2 cores

Works better with multi-threaded drivers

ETQW

Quake III Arena Renderer back-end runs in a separate thread Very similar to QUAKE 4

Renderer back-end runs in a separate thread

Very similar to QUAKE 4

DOOM III Initially had the same threading as Quake III Arena Very much memory bound We actually removed the threading Instead SIMD optimized rendering pipeline The pipeline is optimized for cache usage http://softwarecommunity.intel.com/articles/eng/2773.htm

Initially had the same threading as Quake III Arena

Very much memory bound

We actually removed the threading

Instead SIMD optimized rendering pipeline

The pipeline is optimized for cache usage

http://softwarecommunity.intel.com/articles/eng/2773.htm

ETQW Threading overview Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Mega Texture Streaming Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Mega Texture Streaming The Mega Texture streaming thread dynamically sorts tile read requests. This thread is not doing any significant amount of work and mostly waits in place while data is being read from disk. The streaming is optimized using a texture database with an optimized layout to minimize seek times. The streaming thread reads 128 kB non-cached sector aligned blocks of data for optimal streaming from a DVD without polluting file system caches.

The Mega Texture streaming thread dynamically sorts tile read requests.

This thread is not doing any significant amount of work and mostly waits in place while data is being read from disk.

The streaming is optimized using a texture database with an optimized layout to minimize seek times.

The streaming thread reads 128 kB non-cached sector aligned blocks of data for optimal streaming from a DVD without polluting file system caches.

Mega Texture Transcoding Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Mega Texture Transcoding Real-Time conversion from JPEG-like format to DXT. The transcoding uses highly optimized SIMD code and as such this thread does not consume a whole lot of CPU time. On systems based on the Core 2 microarchitecture the mega texture transcoding thread typically consumes less than 15% CPU time. Real-Time Texture Streaming & Decompression http://softwarecommunity.intel.com/articles/eng/1221.htm Real-Time DXT Compression http://www.intel.com/cd/ids/developer/asmo-na/eng/324337.htm

Real-Time conversion from JPEG-like format to DXT.

The transcoding uses highly optimized SIMD code and as such this thread does not consume a whole lot of CPU time.

On systems based on the Core 2 microarchitecture the mega texture transcoding thread typically consumes less than 15% CPU time.

Real-Time Texture Streaming & Decompression http://softwarecommunity.intel.com/articles/eng/1221.htm

Real-Time DXT Compression

http://www.intel.com/cd/ids/developer/asmo-na/eng/324337.htm

Sound Engine Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Sound Engine The sound system performs spatialization. Decompresses OGG sounds in real-time. The sound thread does not consume a whole lot of CPU (typically < 5% on a Core 2).

The sound system performs spatialization.

Decompresses OGG sounds in real-time.

The sound thread does not consume a whole lot of CPU (typically < 5% on a Core 2).

Game Logic Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Game Logic The game logic runs at a fixed 30 Hz. The game code consumes quite a bit of CPU. A lot of this is collision detection and physics. The game logic itself typically involves lots of branchy code and can be expensive as well.

The game logic runs at a fixed 30 Hz.

The game code consumes quite a bit of CPU.

A lot of this is collision detection and physics.

The game logic itself typically involves lots of branchy code and can be expensive as well.

Bot AI Game Logic Bot AI Sound Engine Renderer Sound Driver Graphics Driver MegaTexture Transcoding MegaTexture Streaming

Bot AI The development of ETQW AI/bots did not start at the beginning of the project. On one hand this was a good thing because the AI implements thousands of game dependent rules that would have to change as the game is changed and tweaked during development. On the other hand the ETQW AI was developed in about a year which really is a short period of time to develop AI for a game with the complexity of ETQW.

The development of ETQW AI/bots did not start at the beginning of the project.

On one hand this was a good thing because the AI implements thousands of game dependent rules that would have to change as the game is changed and tweaked during development.

On the other hand the ETQW AI was developed in about a year which really is a short period of time to develop AI for a game with the complexity of ETQW.

Bot AI The AI threading in ETQW was designed and planned from the start. As a result the threading had little impact on the development time. The threading actually forced us to implement AI with clear data separation from the game code because the data has to be buffered. This is a good thing!

The AI threading in ETQW was designed and planned from the start.

As a result the threading had little impact on the development time.

The threading actually forced us to implement AI with clear data separation from the game code because the data has to be buffered.

This is a good thing!

Bot AI The path and route finding system only run in the AI thread and as such do not need to be &quot;thread safe&quot;. The collision detection system had to be made thread safe. At any point in time the AI can query the current collision state of the world. Unfortunately this introduces a source of non-determinism because the AI can query the collision state while the physics, which runs in the game thread, is moving objects around at the same time.

The path and route finding system only run in the AI thread and as such do not need to be &quot;thread safe&quot;.

The collision detection system had to be made thread safe.

At any point in time the AI can query the current collision state of the world.

Unfortunately this introduces a source of non-determinism because the AI can query the collision state while the physics, which runs in the game thread, is moving objects around at the same time.

Bot AI static const int MIN_FRAME_DELAY = 0; static const int MAX_FRAME_DELAY = 4; HANDLE gameSignal; HANDLE aiSignal; Int gameFrameNum; int lastAIGameFrameNum; void GameThread() { for ( ; ; ) { SetCurrentGameOutputState(); AdvanceWorld(); SetCurrentGameWorldState(); gameFrameNum++ // let the AI thread know there's another game frame ::SetEvent( gameSignal ); // wait if the AI thread is falling too far behind while( lastAIGameFrameNum < gameFrameNum - MAX_FRAME_DELAY ) { ::SignalObjectAndWait( gameSignal, aiSignal, INFINITE, FALSE ); } } }

Bot AI void AIThread() { for ( ; ; ) { // let the game thread know another AI frame has started ::SetEvent( aiSignal ); // never run more AI frames than game frames while( lastAIGameFrameNum >= gameFrameNum - MIN_FRAME_DELAY ) { ::SignalObjectAndWait( aiSignal, gameSignal, INFINITE, FALSE ); } lastAIGameFrameNum = gameFrameNum; SetCurrentAIWorldState(); AdvanceAI(); SetCurrentAIOutputState(); } }

Bot AI The last optimization we did in ETQW cut AI CPU usage in half and it took less than a minute to implement. We simply changed the MIN_FRAME_DELAY from zero to one. This reduces the think frequency of the AI to 15Hz. In Quake III Arena the bots were only thinking at 10Hz.

The last optimization we did in ETQW cut AI CPU usage in half and it took less than a minute to implement. We simply changed the MIN_FRAME_DELAY from zero to one.

This reduces the think frequency of the AI to 15Hz.

In Quake III Arena the bots were only thinking at 10Hz.

Threading On/Off Always implement an option to switch between threaded mode and non-threaded in real-time. This is very useful to see the true performance difference. Also makes it much easier when debugging the threaded code.

Always implement an option to switch between threaded mode and non-threaded in real-time.

This is very useful to see the true performance difference.

Also makes it much easier when debugging the threaded code.

Common Issues Load Imbalance Under utilization of processors Gustafson’s law increasing the amount of parallel work Adding new features in games like fracture, smoke, cloth, procedural texture Amdahl’s law - Need to reduce Serial time to improve scaling Parallelize code as far as possible Vectorize serial code Reduce time spent in a serial memory allocator Over subscription Different Threaded subsystems Threading at various levels of the application stack Threaded middleware

Load Imbalance

Under utilization of processors

Gustafson’s law increasing the amount of parallel work

Adding new features in games like fracture, smoke, cloth, procedural texture

Amdahl’s law - Need to reduce Serial time to improve scaling

Parallelize code as far as possible

Vectorize serial code

Reduce time spent in a serial memory allocator

Over subscription

Different Threaded subsystems

Threading at various levels of the application stack

Threaded middleware

Scalability PCs have a broad range of capabilities from CPU to Graphics Even with a fixed target platform its hard to load balance for real game play. Scene complexity, interactivity, physics vary from scene to scene Need to think how to make best use of resources Granularity Vs Load Balancing Common threading infrastructure with priorities/QoS.

PCs have a broad range of capabilities from CPU to Graphics

Even with a fixed target platform its hard to load balance for real game play.

Scene complexity, interactivity, physics vary from scene to scene

Need to think how to make best use of resources

Granularity Vs Load Balancing

Common threading infrastructure with priorities/QoS.

Alternate Threading Paradigms Data Decomposition AI AI AI AI Physics Physics Physics Physics Renderer

Data Decomposition

Alternate Threading Paradigms Task/Work decomposition / Pipeline Frame AI Physics Render FE Render BE AI Physics Render FE Task Stealing T 0 T 1

Task/Work decomposition / Pipeline

Design with threading in mind Lot easier to thread code that’s designed well. Reduce the coupling (data-dependance) between subsystems Make them as asynchronous as far as possible. Factor a given subsystem into data and operations performed on the data (iterators). Make sure that data classes don’t store any iterator data and are reentrant. Have a mechanism to ensure validity of shared, mutable data. Intels Threading Building Blocks (TBB) has some good resources like thread safe conatiners, efficient memory allocator, generic parallel algorithms (parallel for, ….) and its open source.

Lot easier to thread code that’s designed well.

Reduce the coupling (data-dependance) between subsystems

Make them as asynchronous as far as possible.

Factor a given subsystem into data and operations performed on the data (iterators).

Make sure that data classes don’t store any iterator data and are reentrant.

Have a mechanism to ensure validity of shared, mutable data.

Intels Threading Building Blocks (TBB) has some good resources like thread safe conatiners, efficient memory allocator, generic parallel algorithms (parallel for, ….) and its open source.

Summary Threading Game Engines is not a trivial task - Game engines are very complex pieces of code with a relatively short shelf life. Game engines naturally lend themselves to functional decomposition but interdependence between the various subsystems can cause excessive synchronization and performance overheads. Functional decomposition leads to load imbalance and often performance is limited by the main thread. Need to Investigate alternate paradigms like Task Queues to improve load balance. Need to design and implement debugging aids into the threading infrastructure Interaction with the GPU makes debugging harder

Threading Game Engines is not a trivial task - Game engines are very complex pieces of code with a relatively short shelf life.

Game engines naturally lend themselves to functional decomposition but interdependence between the various subsystems can cause excessive synchronization and performance overheads.

Functional decomposition leads to load imbalance and often performance is limited by the main thread. Need to Investigate alternate paradigms like Task Queues to improve load balance.

Need to design and implement debugging aids into the threading infrastructure

Interaction with the GPU makes debugging harder

Contact Info For more info –see our Graphics, Game Development and Threading resources at: http:// softwarecommunity.intel.com /

For more info –see our Graphics, Game Development and Threading resources at: http:// softwarecommunity.intel.com /

Add a comment

Related presentations

Related pages

Threading Game Engines - MrElusive.com

Threading Game Engines - QUAKE 4 & Enemy Territory QUAKE Wars Anu Kalra - Intel Corporation Jan Paul van Waveren-id Software Feb 21, 2006
Read more

Enemy Territory: Quake Wars (PC) - Test, Download ...

Die erste Enemy Territory: Quake Wars Saison der GameStar Liga wird am heutigen Dienstag mit dem Finale ihr vorläufiges Ende finden. Das grosse Finale der ...
Read more

Quake 4 | Linux game database - lgdb.org

Quake 4 is the sequel to Quake II and takes ... place during the same war as Enemy Territory: Quake Wars. ... Story is lame the game is just ...
Read more

Enemy Territory: Quake Wars - Encyclopedia Gamia - Wikia

Enemy Territory: Quake Wars ... Unlike the previous Enemy Territory game, Quake Wars is a commercial ... version of id Software's id Tech 4 engine with ...
Read more

IdTech4 File Unpacker 1.5 download - Enemy Territory ...

... is an automatic file extraction software for games based on the IdTech4 engine ... Quake 4, Prey and Enemy Territory: Quake Wars ... threading core ...
Read more

Enemy Territory: Quake Wars | Linux game database

Enemy Territory: QUAKE Wars is set in the ... Quake II and Quake 4. Taking place in the year 2065, the game lets you play as ... and engine but slightly ...
Read more

Sponsored Video: Threading Quake 4 and Quake Wars

Sponsored Video: Threading Quake 4 and ... Quake 4 and Splash Damage's spinoff title Enemy Territory: Quake Wars. ... the engine and the ...
Read more

id Tech 4 engine - Mod DB

... Quake 4 and Enemy Territory Quake Wars, Tech 4 revolutionized the use of ... id Tech 4 engine. ... The Slender Shore is a Id tech 4 game based on the ...
Read more

Enemy Territory: Quake Wars | Health Games Research

Each license includes full source and tools for DOOM 3, QUAKE 4 and Enemy Territory: QUAKE Wars ... Game Engine. Target Population: General Audience.
Read more