IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

100 %
0 %
Information about IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance...
Education

Published on January 26, 2009

Author: npinto

Source: slideshare.net

Description

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from NVIDIA.

CUDA Tricks and Computational Physics Kipton Barros Boston University In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz

High energy physics huge computational needs Large Hadron Collider, CERN 27 km

A request: Please question/comment freely during the talk A disclaimer: I’m not a high energy physicist

View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN) .

15 Petabytes to be processed annually View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)

The “Standard Model” of Particle Physics

I’ll discuss Quantum ChromoDynamics Although it’s “standard”, these equations are hard to solve Big questions: why do quarks appear in groups? physics during big bang?

Quantum ChromoDynamics The theory of nuclear interactions (bound by “gluons”) Extremely difficult: Must work at the level of fields, not particles Calculation is quantum mechanical

Lattice QCD: Solving Quantum Chromodynamics by Computer Discretize space and time (place the quarks and gluons on a 4D lattice)

Spacetime = 3+1 dimensions 32 ∼ 10 4 6 lattice sites Quarks live on sites (24 floats each) Gluons live on links (18 floats each) lattice sites 4 × 324 × (24 + 4 × 18) ∼ 384MB Total system size gluons float bytes quarks

Lattice QCD: Inner loop requires repeatedly solving linear equation quarks gluons DW is a sparse matrix with only nearest neighbor couplings DW needs to be fast!

DW Operation of 1 output quark site (24 floats)

DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats)

DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats)

DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats) 1.4 kB of local storage required per quark update?

Cuda Parallelization: Must process many quark updates simultaneously Odd/even sites processed separately

ding hrea T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009

DW parallelization: Each thread processes 1 site No communication required bet ween threads! All threads in warp execute same code

Step 1: Read neighbor site

Step 1: Read neighbor site Step 2: Read neighbor link

Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into

Step 4: Read neighbor site Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into

Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 3: Accumulate into

Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 6: Accumulate into Step 3: Accumulate into

xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ !!! A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009

xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009

Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) High occupancy needed for (roughly 25% or so) maximum performance

DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory 24 12 floats 18 floats 24 floats

DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80

DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only)

DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only) MP occupancy = 64/1024 = 6%

6% occupancy sounds pretty bad! Andreas Kuehn / Getty

How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) Each thread requires 0.2 kb of fast local memory

How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory Occupancy > 25% 16 k registers = 64 kb memory 1024 active threads Each thread requires 0.2 kb of fast local memory

Registers as data (possible because no inter-thread communication) Instead of shared memory Registers are allocated as

Registers as data Can’t be indexed. All loops must be EXPLICITLY expanded

Code sample (approx. 1000 LOC automatically generated)

Performance Results: 44 Gigabytes/sec (Tesla C870) 82 Gigabytes/sec (GTX 280) (90 Gflops/s) (completely bandwidth limited) For comparison: t wice as fast as Cell impl. (arXiv:0804.3654) 20 times faster than CPU implementations

GB/s vs Occupancy Tesla C870 GTX 280 GB/s GB/s 45.00 85.00 33.75 63.75 22.50 42.50 11.25 21.25 0 0 ≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0% Occupancy Occupancy Surprise! Very robust to low occupancy

Device memory is the bottleneck Coalesced memory accesses crucial Data reordering Quark 1 Quark 2 Quark 3 q21 , q22 , ...q224 q31 , q32 , ...q324 ... q11 , q12 , ...q124 q11 q21 q31 ... q12 q22 q32 ... thread 1 ... thread 0 thread 2

Memory coalescing: store even/odd lattices separately

When memory access isn’t perfectly coalesced Sometimes float4 arrays can hide latency This global memory read corresponds to a single CUDA instruction In case of coalesce miss, at least 4x data is transfered thread 0 thread 1 thread 2

When memory access isn’t perfectly coalesced Binding to textures can help corresponds to a single CUDA instruction This makes use of the texture cache and can reduce penalty for nearly coalesced accesses

Regarding textures, there are t wo kinds of memory: Linear array Can be modified in kernel Can only be bound to 1D texture “Cuda array” Can’t be modifed in kernel Gets reordered for 2D, 3D locality Allows various hardware features

When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-cur ve This gives 2D locality Wikipedia image

Warnings: The effectiveness of float4, textures, depends on the CUDA hardware and driver (!) Certain “magic” access patterns are many times faster than others Testing appears to be necessary

Memory bandwidth test Simple kernel Memory access completely coalesced Should be optimal

Memory bandwidth test Simple kernel Memory access completely coalesced Bandwidth: 54 Gigabytes / sec (GTX 280, 140 GB/s theoretical!)

So why are NVIDIA samples so fast? NVIDIA actually uses 54 Gigabytes / sec 102 Gigabytes / sec (GTX 280, 140 GB/s theoretical)

Naive access pattern Step 1 ... ... Block 1 Block 2 Step 2 ... ... Block 1 Block 2

Modified access pattern (much more efficient) Step 1 ... ... Block 1 Block 2 ... Step 2 ... Block 1 Block 2

CUDA Compiler (LOTS of optimization here) CUDA PTX CUDA machine C code code code Use unofficial CUDA disassembler to view CUDA machine code CUDA disassembly

CUDA Disassembler (decuda) foo.cu Compile and save cubin file Disassemble

Look how CUDA implements integer division!

CUDA provides fast (but imperfect) trigonometry in hardware!

The compiler is very aggressive in optimization. It will group memory loads together to minimize latency (snippet from LQCD) Notice: each thread reads 20 floats!

Add a comment

Related presentations

Related pages

Lectures - IAP 2009 CUDA@MIT / 6.963 - Google Sites

Guest Lectures : Case studies. ... CUDA Tricks and High-Performance Computational Physics Kipton Barros, BU
Read more

IAP 2009 CUDA@MIT / 6.963 - Google Sites

MIT IAP 2009 / Supercomputing on your desktop: ... Welcome to IAP 2009 CUDA @ MIT / 6.963. ... PhD Student Kipton Barros from Boston University;
Read more

Download torrents, Download torrent, torrent tracker

Toggle navigation SlowTorrent.com. Home; Categories; Latest Searches;
Read more