Tech Talk NVIDIA CUDA

67 %
33 %
Information about Tech Talk NVIDIA CUDA

Published on July 23, 2009

Author: ruehmkorf

Source: slideshare.net

NVIDIA CUDA The Compute Unified Device Architecture Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009 Slide 1 CUDA > J. Rühmkorf > July 22nd 2009

References University of Illinois at Urbana-Champaign, Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009: http://courses.ece.illinois.edu/ece498/al/ Website about General-Purpose Computation on Graphics Hardware: http://gpgpu.org/developer/cuda ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU): http://mags.acm.org/queue/20080304/ Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13: http://www.ddj.com/architect/207200659 NVIDIA CUDA Best Practices Guide http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI DIA_CUDA_BestPracticesGuide_2.3.pdf Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online: http://developer.nvidia.com/object/gpu-gems-3.html Slide 2 CUDA > J. Rühmkorf > July 22nd 2009

Multi- and Manycore Architectures A Difficult Road Lies Ahead Don Knuth on Multicore Architectures “[…] my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop” In: InformIT, April 25th 2008 http://www.informit.com/articles/article.aspx?p=1193856 Slide 3 CUDA > J. Rühmkorf > July 22nd 2009

Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 4 CUDA > J. Rühmkorf > July 22nd 2009

Multicore and Manycore (1) Structural Differences Multicore Manycore Multicore: yoke of oxen Each core optimized for executing a single thread Manycore: flock of chickens Cores optimized for aggregate throughput, deemphasizing individual performance Slide 5 CUDA > J. Rühmkorf > July 22nd 2009

Multicore and Manycore (2) Technical Characteristics Specifica- Core i7 960 GTX285 tions 4 cores, 4 way 30 cores, 8 way Processing SIMD SIMD Elements @3.2 GHz @1.5 GHz 4 cores, 2 threads, 30 cores, 32 SIMD 4 width SIMD: Resident Threads (max) vectors, 32 width SIMD: Core i7 32 strands 30720 strands SP GFLOP/s 102 1080 Memory Bandwidth 25.6 GB/s 159 GB/s Register File - 1.875 MB Local Store - 480 kB GTX285 Slide 6 CUDA > J. Rühmkorf > July 22nd 2009

Multicore and Manycore (3) Performance Comparison: CPU vs. GPU CPU vs. GPU y-axis: floating point operations per sec., single precision Slide 7 CUDA > J. Rühmkorf > July 22nd 2009

An Example of the Physical Reality Behind CUDA CPU GPU w/ (host) local DRAM (device) Slide 8 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Processing Flow Slide 9 CUDA > J. Rühmkorf > July 22nd 2009

CUDA in a Nutshell Key Characteristics CUDA is designed for wide SIMD/SPMD parallelism & scalability CUDA provides 3 key abstractions, i.e. a hierarchy: of thread groups, of shared memories, and of barrier synchronization CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral Programming model essentially identical Slide 10 CUDA > J. Rühmkorf > July 22nd 2009

Hello World // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i] } int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); } hello-world.cu Slide 11 CUDA > J. Rühmkorf > July 22nd 2009

Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 12 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model Structure of a CUDA application Integrated host + device application C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); ... Slide 13 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Express data-parallel portions as device kernels (which run on many threads) Slide 14 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model Arrays of Parallel Threads Execute a Kernel by specifying arrays of threads All threads run the same code (SPMD) Use thread-ID to compute memory addresses & make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Slide 15 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model Use Thread Blocks for (Scalable) Cooperation Divide monolithic thread array into multiple blocks Threads within a block can cooperate via shared memory, atomic operations, and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … … … float x = float x = float x = input[threadID]; float y = func(x); output[threadID] = y; input[threadID]; float y = func(x); output[threadID] = y; … input[threadID]; float y = func(x); output[threadID] = y; … … … Slide 16 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model Organisation of Thread Blocks Thread Blocks can be one-, two- or three-dimensional arrays The host issues a sequence of kernel invocations (kernel 1, kernel 2) to the device Each kernel is executed as a batch of threads This batch is organized as a grid of thread blocks 2-dimensional thread blocks Slide 17 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Programming Model Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D, 2D, or 3D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Slide 18 CUDA > J. Rühmkorf > July 22nd 2009

Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 19 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Memory Model Overview Global memory Grid Main means of communicating R/W Data between host and Block (0, 0) Block (1, 0) device Shared Memory Shared Memory Contents visible to all threads Long latency access Registers Registers Registers Registers We will focus on global memory for now Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Constant and texture memory will not be covered here Host Global Memory Constant Memory Texture Memory Slide 20 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Memory Model CUDA Device Memory Allocation cudaMalloc(): Grid Allocates object in the device global memory Block (0, 0) Block (1, 0) Requires two parameters Shared Memory Shared Memory Address of a pointer to the allocated object Registers Registers Registers Registers Size of allocated object cudaFree() Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Frees object from the device global memory Host Global Pointer to freed object Memory Slide 21 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Memory Model CUDA Host-Device Data Transfer cudaMemcpy() Grid memory data transfer Block (0, 0) Block (1, 0) Requires four parameters Pointer to destination Shared Memory Shared Memory Pointer to source Registers Registers Registers Registers Number of bytes to copy Type of transfer Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Type of transfer is one of: Host to Host Host to Device Host Global Memory Device to Host Device to Device Asynchronous transfer Slide 22 CUDA > J. Rühmkorf > July 22nd 2009

Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 23 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl Mark Murphy, “NVIDIA’s Experience with G80 SASS foo.sass with Open64,” www.capsl.udel.edu/conferences/open64/2008 008/Papers/101.doc Slide 24 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API C for CUDA Function type specifiers image-convolution.cu __global__, __device__, __host__ __device__ float filter[N]; Variable type specifiers __device__, __shared__, __global__ void convolve(float __constant__ *image) { Keywords __shared__ float region[M]; threadIdx, blockIdx ... Intrinsics / builtin functions: region[threadIdx] = image[i]; __syncthreads() __syncthreads() Runtime API ... Memory, symbol, execution management image[j] = result; Function launch } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>>(myimage); Slide 25 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API CUDA Function Type Qualifiers (1) only callable executed on: from: __device__ float deviceFunc() device device __global__ void kernelFunc() device host __host__ float hostFunc() host host __global__: defines a kernel function must return void __device__ and __host__ can be used together Slide 26 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API CUDA Function Type Qualifiers (2) __device__ functions cannot have their address taken For functions executed on the device: no recursion no static variable declarations inside the function no variable number of arguments Slide 27 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API CUDA Variable Type Qualifiers __device__ Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. __shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block. Not covered here: __constant__ (optionally used together with __device__) Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. Slide 28 CUDA > J. Rühmkorf > July 22nd 2009

CUDA API Calling a Kernel Function – Execution Configuration A kernel function ( == __global__ function) must be called with an execution configuration: __global__ void kernelFunc(...) {...}; dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block size_t sharedMemBytes = 64; // 64 bytes of shared memory kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on Explicit synchronization needed for blocking Slide 29 CUDA > J. Rühmkorf > July 22nd 2009

Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 30 CUDA > J. Rühmkorf > July 22nd 2009

A Simple CUDA Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity Slide 31 CUDA > J. Rühmkorf > July 22nd 2009

Simple CUDA Example Square Matrix Multiplication N P = M * N of size WIDTH x WIDTH WIDTH Here: without tiling! One thread calculates one element of P M and N are loaded WIDTH times from global memory M P WIDTH WIDTH WIDTH Slide 32 CUDA > J. Rühmkorf > July 22nd 2009

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Slide 33 CUDA > J. Rühmkorf > July 22nd 2009

Step 1: Matrix Multiplication N A Simple Host Version in C k // Matrix multiplication on the (CPU) // host in double precision j WIDTH void matrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { double sum = 0; for (int k = 0; k < width; ++k) { M P double a = M[i * width + k]; double b = N[k *i width + j]; sum += a * b; WIDTH } P[i * width + j] = sum; } k } WIDTH WIDTH Slide 34 CUDA > J. Rühmkorf > July 22nd 2009

Step 2: Input Matrix Data Transfer (Host-sided Code) void matrixMulOnDevice(float* M, float* N, float* P, int width){ int size = width * width * sizeof(float); float* Md, *Nd, *Pd; … 1. // Allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Slide 35 CUDA > J. Rühmkorf > July 22nd 2009

Step 3: Output Matrix Data Transfer (Host-sided Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); } Slide 36 CUDA > J. Rühmkorf > July 22nd 2009

Step 4: Kernel Function (1) // Matrix multiplication kernel – per thread code __global__ void matrixMulKernel(float* Md, float* Nd, float* Pd, int width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; // see next page… Slide 37 CUDA > J. Rühmkorf > July 22nd 2009

Step 4: Kernel Function (2) Nd for (int k = 0; k < width; ++k) { k WIDTH float Melement = Md[k + threadIdx.y*width]; tx float Nelement = Nd[threadIdx.x + k*width]; Pvalue += Melement * Nelement; } Md Pd { ty ty int i = threadIdx.x + WIDTH threadIdx.y*width; Pd[i] = Pvalue; k tx } } WIDTH WIDTH Slide 38 CUDA > J. Rühmkorf > July 22nd 2009

Step 5: Kernel Invocation (Host-sided Code) // Insert into step 2. from before // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(width, width); // Launch the device computation threads! matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width); Slide 39 CUDA > J. Rühmkorf > July 22nd 2009

Example Far from Ideal Only One Thread Block Used Grid 1 Nd Block 1 One Block of threads compute matrix 2 Pd 4 Each thread computes one element of Pd Thread (2, 2) 2 Each thread 6 Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) Size of matrix limited by the number WIDTH of threads allowed in a thread block Md Pd Slide 40 CUDA > J. Rühmkorf > July 22nd 2009

CUDA: A Bright Future? Eure Rede aber sei: Ja, ja; nein, nein. Was darüber ist, das ist vom Übel. Matthäus 5:37 Slide 41 CUDA > J. Rühmkorf > July 22nd 2009

NVIDIA CUDA: Appendix Best Practices & Things to Watch Out For Slide 42 CUDA > J. Rühmkorf > July 22nd 2009

Appendix Best Practices & Things to Watch Out For Obtain relevant hardware data Compiling a CUDA program Linking Debugging C for CUDA vs. CUDA Driver API Watch out: floating point computations Unsupported C language elements Branching of code Coalesced access to device global memory Access patterns to avoid bank conflicts Slide 43 CUDA > J. Rühmkorf > July 22nd 2009

Obtain Relevant Hardware Data Make sure to obtain relevant additional hardware data Call cudaGetDeviceProperties() Slide 44 CUDA > J. Rühmkorf > July 22nd 2009

Compiling a CUDA Program (1) Parallel Thread eXecution C/C++ CUDA float4 me = gx[gtid]; (PTX) Application me.x += me.y * me.z; Virtual Machine and ISA (Instruction Set Architecture) NVCC CPU Code Programming model Execution Virtual PTX Code resources and state Physical PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; Compiler mad.f32 $f1, $f5, $f3, $f1; G80 … GPU Target code Slide 45 CUDA > J. Rühmkorf > July 22nd 2009

Compiling a CUDA Program (2) NVCC as a Compiler Driver Any source file containing CUDA language extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... NVCC outputs: C code (host CPU Code) Must then be compiled with the rest of the application using another tool PTX Object code directly Or, PTX source, interpreted at runtime Slide 46 CUDA > J. Rühmkorf > July 22nd 2009

Linking Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library (cudart) The CUDA core library (cuda) Slide 47 CUDA > J. Rühmkorf > July 22nd 2009

Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (enabled via nvcc -deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread Running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads() Slide 48 CUDA > J. Rühmkorf > July 22nd 2009

Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Slide 49 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Driver API vs. C for CUDA (1) Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl G80 SASS Mark Murphy, “NVIDIA’s Experience foo.sass Experience with Open64,” www.capsl.udel.edu/conferences/open6 n64/2008/Papers/101.doc Slide 50 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Driver API vs. C for CUDA (2) Mutually Exclusive: Choose One or the Other The C runtime for CUDA handles kernel loading The driver API is a lower-level API than the and setting kernels before they are launched. The runtime API. When compared with the runtime implicit code initialization, CUDA context API, the driver API has these advantages: management, CUDA module management (cubin No dependency on the runtime library and function mapping), kernel configuration, and parameter passing are all performed by the C More control over devices (for example, runtime for CUDA. only the driver API enables one CPU thread to control multiple GPUs) It comprises two principal parts: No C extensions in the host code, so The low-level functions compilers other than the default CPU (cuda_runtime_api.h) have a C-style compiler can be used interface that does not require compilation with nvcc. Its primary disadvantages The high-level functions (cuda_runtime.h) Verbose code have a C++-style interface built on top of Greater difficulty in debugging the low-level functions. No device emulation Of these, the high-level functions are the most A key point is that for every runtime API function, commonly used. They wrap some of the low-level there is an equivalent driver API function. The functions, using overloading, references, and driver API, however, includes other functions default arguments. These wrappers can be used missing in the runtime API, such as those for from C++ code and can be compiled with any C++ migrating a context from one host thread to compiler. another. Slide 51 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Driver API vs. C for CUDA (3) Example: Vector Addition Using C for CUDA const unsigned int cnBlockSize = 512; // copy host vectors to device const unsigned int cnBlocks = 3; cudaMemcpy(pDeviceMemA, pA, cnDimension * const unsigned int cnDimension = sizeof(float), cnBlocks * cnBlockSize; cudaMemcpyHostToDevice); // create CUDA device & context cudaMemcpy(pDeviceMemB, pB, cnDimension * cudaSetDevice( 0 ); // pick first device sizeof(float), cudaMemcpyHostToDevice); // allocate host vectors float * pA = new float[cnDimension]; vectorAdd<<<cnBlocks, cnBlockSize>>> float * pB = new float[cnDimension]; (pDeviceMemA, pDeviceMemB, float * pC = new float[cnDimension]; pDeviceMemC); // initialize host memory randomInit(pA, cnDimension); // copy result from device to host randomInit(pB, cnDimension); cudaMemcpy ((void *) pC, pDeviceMemC, cnDimension * sizeof(float), cudaMemcpyDeviceToHost); // allocate device memory delete[] pA; float *pDeviceMemA, *pDeviceMemB, *pDeviceMemC; delete[] pB; cudaMalloc((void **)&pDeviceMemA, delete[] pC; cnDimension * sizeof(float)); cudaFree(pDeviceMemA); cudaMalloc((void **)&pDeviceMemB, cudaFree(pDeviceMemB); cnDimension * sizeof(float)); cudaFree(pDeviceMemC); cudaMalloc((void **)&pDeviceMemC, cnDimension * sizeof(float)); Slide 52 CUDA > J. Rühmkorf > July 22nd 2009

CUDA Driver API vs. C for CUDA (4) Example: Vector Addition Using CUDA Driver API const unsigned int cnBlocks = 3; #define ALIGN_UP(offset, alignment) const unsigned int cnDimension = cnBlocks * cnBlockSize; (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1) CUdevice hDevice; CUcontext hContext; int offset = 0; CUmodule hModule; void* ptr; CUfunction hFunction; ptr = (void*)(size_t)pDeviceMemA; // create CUDA device & context ALIGN_UP(offset, __alignof(ptr)); cuInit(0); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); cuDeviceGet(&hContext, 0); // pick first device offset += sizeof(ptr); cuCtxCreate(&hContext, 0, hDevice)); ptr = (void*)(size_t)pDeviceMemB; cuModuleLoad(&hModule, “vectorAdd.cubin”); ALIGN_UP(offset, __alignof(ptr)); cuModuleGetFunction(&hFunction, hModule, "vectorAdd"); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // allocate host vectors offset += sizeof(ptr); float * pA = new float[cnDimension]; ptr = (void*)(size_t)pDeviceMemC; float * pB = new float[cnDimension]; ALIGN_UP(offset, __alignof(ptr)); float * pC = new float[cnDimension]; cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // initialize host memory offset += sizeof(ptr); randomInit(pA, cnDimension); cuParamSetSize(cuFunction, offset); randomInit(pB, cnDimension); // execute kernel // allocate memory on the device cuLaunchGrid(cuFunction, cnBlocks, 1); CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC; // copy the result from device back to host cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float)); cuMemcpyDtoH((void *) pC, pDeviceMemC, cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float)); cnDimension * sizeof(float)); cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float)); delete[] pA; // copy host vectors to device delete[] pB; cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float)); delete[] pC; cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float)); cuMemFree(pDeviceMemA); // set up parameter values cuMemFree(pDeviceMemB); cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1); cuMemFree(pDeviceMemC); Slide 53 CUDA > J. Rühmkorf > July 22nd 2009

Watch Out: Floating Point Computations Differing Results of FP Computations Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results There are various options to force strict single precision on the host Slide 54 CUDA > J. Rühmkorf > July 22nd 2009

Watch Out: Floating Point Computations Single and Double Precision Operations Double precision No deviations from the IEEE 754 standard Single precision Denormals and signalling NaNs are not supported; Only two IEEE rounding modes are supported (chop and round-to- nearest even); and The precision of division/square root is slightly lower than single precision. Slide 55 CUDA > J. Rühmkorf > July 22nd 2009

Limitations (1) Only a Subset of C Available C for CUDA offers only a subset of the C language: Recursion-free Function-pointer-free Functions reside in the global device memory, therefore we cannot obtain their addresses Slide 56 CUDA > J. Rühmkorf > July 22nd 2009

Limitations (2) Branching in Programm Code For best performance Threads should be running in groups 32 threads 32 threads = 1 warp All threads of a warp should take the same execution path Otherwise, branching will probably hurt Slide 57 CUDA > J. Rühmkorf > July 22nd 2009

Coalesced Access to Device Global Memory High Priority: Ensure global memory accesses are coalesced whenever possible Global memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) But: certain access requirements have to be met Slide 58 CUDA > J. Rühmkorf > July 22nd 2009

Coalesced Access (Reading Floats) Slide 59 CUDA > J. Rühmkorf > July 22nd 2009

Uncoalesced Access (Reading Floats) Slide 60 CUDA > J. Rühmkorf > July 22nd 2009

Shared Memory – Bank Conflicts Shared Memory 16 KB Organized in 16 Banks, 1 KB each Shared Memory As fast as a register … … if no bank conflicts occur! Bank conflict: More than one thread in the same half-warp access the same bank Access needs to be serialized  Cost = max (# of concurrent access) Slide 61 CUDA > J. Rühmkorf > July 22nd 2009

Shared Memory – No Bank Conflicts Linear addressing Random Linear addressing Broadcast Step size = 1 Permutation Step size = 3 Word Words Slide 62 CUDA > J. Rühmkorf > July 22nd 2009

Shared Memory – Bank Conflicts Linear adressing Linear addressing No conflict Step size = 2 Step size = 8 or 5-way Words words conflict Slide 63 CUDA > J. Rühmkorf > July 22nd 2009

Add a comment

Related pages

Tech Talks at SIGGRAPH 2013 - NVIDIA

Tech Talks at SIGGRAPH 2013. The Tech Talks included discussions and demos on a wide range of topics covering the latest advancements in visual computing.
Read more

Company Tech Talk: NVIDIA | Carnegie Mellon School of ...

Search form. Search . Company Tech Talk: NVIDIA
Read more

Tech Talk NVIDIA CUDA - Education - documents

1.NVIDIA CUDA The Compute Unified Device ArchitectureJens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009 Slide 1CUDA > J. Rühmkorf > July 22nd 2009…
Read more

Support|NVIDIA

Discuss NVIDIA ® based products, talk about the latest games, ... CUDA Support CUDA programming support can be found here at the NVIDIA CUDA support forums.
Read more

NVIDIA CUDA technology - CUDA | GeForce

CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the ...
Read more

Siggraph 2011 | NVIDIA

NVIDIA was on the SIGGRAPH 2011 show floor. NVIDIA and our many ... NVIDIA Tech Talks Presented ... 11:30-12:30 Parallel Nsight™ 2.0 and CUDA 4.0 for ...
Read more

NVIDIA Developer Forums

CUDA-GDB allows you to debug both the CPU and GPU portions of ... Gain direct access to NVIDIA GPUs and drivers that range beyond the scope of those ...
Read more

nvidia cuda – TechTalkThai

Tag Archives: nvidia cuda. NVIDIA เปิดตัว Supercomputer ขนาดเท่าบัตรเครดิต เร็ว 1 Teraflops ...
Read more

NVIDIA Developer

NVIDIA Developer Zone ... DIGITS™ NVIDIA® DIGITS 4 introduces object ... tech, and more. Apply today. NVIDIA GeForce GTX 1080 NVIDIA is proud to ...
Read more