Lecture II

50 %
50 %
Information about Lecture II
Education

Published on June 19, 2007

Author: Belly

Source: authorstream.com

GP2: General Purpose Computation using Graphics Processors:  GP2: General Purpose Computation using Graphics Processors http://gamma.cs.unc.edu/GPGP Spring 2007 Department of Computer Science UNC Chapel Hill Dinesh Manocha andamp; Avneesh Sud Lecture 2: January 17, 2006 Class Schedule :  Class Schedule Current Time Slot: 2:00 – 3:15pm, Mon/Wed, SN011 Office hours: TBD Class mailing list: gpgp@cs.unc.edu (should be up and running) GPGP:  GPGP The GPU on commodity video cards has evolved into an extremely flexible and powerful processor Programmability Precision Power This course will address how to harness that power for general-purpose computation (non-rasterization) Algorithmic issues Programming and systems Applications Capabilities of Current GPUs:  Capabilities of Current GPUs Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support Modern GPUs support 32-bit floating point precision Great development in the last few years 64-bit arithmetic may be coming soon Almost IEEE FP compliant The Potential of GPGP:  The Potential of GPGP The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation, geometric applications to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor Check out http://www.gpgpu.org GPGP: Challenges:  GPGP: Challenges GPUs designed for and driven by video games Programming model is unusual andamp; tied to computer graphics Programming environment is tightly constrained Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret No clear standards (besides DirectX imposed by MSFT) Can’t simply 'port' code written for the CPU! Is there a formal class of problems that can be solved using current GPUs Importance of Data Parallelism:  Importance of Data Parallelism GPUs are designed for graphics or gaming industry Highly parallel tasks GPUs process independent vertices andamp; fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing GPUs architecture is ALU-heavy Multiple vertex andamp; pixel pipelines, multiple ALUs per pipe Hide memory latency (with more computation) Goals of this Course:  Goals of this Course A detailed introduction to general-purpose computing on graphics hardware Emphasis includes: Core computational building blocks Strategies and tools for programming GPUs Cover many applications and explore new applications Highlight major research issues Course Organization:  Course Organization Survey lectures Instructors, other faculty, senior graduate students Breadth and depth coverage Student presentations Course Contents:  Course Contents Overview of GPUs: architecture and features Models of computation for GPU-based algorithms System issues: Cache and data management; Languages and compilers Numerical and Scientific Computations: Linear algebra computations. Optimization, FFTrigid body simulation, fluid dynamics Geometric computations: Proximity computations; distance fields; motion planning and navigation Database computations: database queries: predicates, booleans, aggregates; streaming databases and data mining; sorting andamp; searching GPU Clusters: Parallel computing environments for GPUs Rendering: Ray-tracing, photon mapping; Shadows Student Load:  Student Load Stay awake in classes! One class lecture Read a lot of papers 1-2 small assignments Student Load:  Student Load Stay awake in classes! One class lecture Read a lot of papers 1-2 small assignments A MAJOR COURSE PROJECT WITH RESEARCH COMPONENT Course Projects:  Course Projects Work by yourself or part of a small team Develop new algorithms for simulation, geometric problems, database computations Formal model for GPU algorithms or GPU hacking Issues in developing GPU clusters for scientific computation Look into new architecture and parallel programming trends Course Projects: Importance:  Course Projects: Importance If you are planning to take this course for credit, start thinking about the course project ASAP It is important that your project has some novelty to it: Shouldn’t be just a GPU-hack You need to work on a problem or application, for which GPUs are a good candidate For example, GPUs are not a good solution for many problems It is ok to work in groups of 2 or 3 (for a large project) Periodic milestones to monitor the progress Project proposals due by February 10 Monthly progress reports (will count towards the final grade) Course Projects: Possible Topics:  Course Projects: Possible Topics We are also interested in comparing GPU capabilities with other emerging architectures (e.g. Cell, multi-core, other data parallel processors) Numerical computations: Some of the prime candidates for GPU acceleration Sparse matrix computations Numerical linear algebra (SVD, QR computations) Applications (like WWW search) Power efficiency of GPU algorithms Programming environments of GPUs (talk to Jan Prins) GPU Clusters and high performance computing using GPUs Scientific computations (possible collaboration with RENCI) Data mining algorithms (talk to Wei Wang or Jan Prins) Physically-based simulation, e.g. fluid simulation (talk to Ming Lin) Others … Course Topics & Lectures:  Course Topics andamp; Lectures Focus on Breadth Quite a few guest and student lectures Overview of OpenGL and GPU Programming (Wendt on Jan. 22) Cell processor (Stephen Olivier on Jan. 24) NVIDIA G80 Architecture (Steve Molnar, Jan. 29) CUDA Programming Environment (Lars Nyland, Jan. 31) Lectures on CTM (ATI) Heterogeneous Computing Systems & GPUs:  Heterogeneous Computing Systems andamp; GPUs Slide18:  Develop computer systems and applications that are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors What are Heterogeneous Computing Systems? Slide19:  Heterogeneous computing systems are those with a range of diverse computing resources that can be local to one another or geographically distributed. The pervasive use of networks and the internet by all segments of modern society means that the number of connected computing resources is growing tremendously. From 'International Workshop on Heterogeneous Computing', from early 1990’s What are Heterogeneous Computing Systems? Computing using Accelerators:  Computing using Accelerators GPU is one type of accelerator (commodity and easily available) Other accelerators: Cell processor Clearspeed Slide21:  Current architectures Use of Accelerators Programming environments for accelerators Organization Slide22:  Current architectures Use of Accelerators Programming environments for accelerators Organization Slide23:  Multi-core architectures Processors lowering communication costs Heterogeneous processors Current Architectures Slide24:  Multi-core architectures Processors lowering communication costs Heterogeneous processors Current Architectures Multi-Core Processor:  Multi-Core Processor Multi-Core Architectureshttp://gamma.cs.unc.edu/EDGE/SLIDES/agarwal.pdf:  Multi-Core Architectures http://gamma.cs.unc.edu/EDGE/SLIDES/agarwal.pdf Multi-Core: Motivation:  Multi-Core: Motivation Multi-Core: Growth Rate:  Multi-Core: Growth Rate Sun’s Niagra Chip: Chip Multi-Threaded Processorhttp://gamma.cs.unc.edu/EDGE/SLIDES/shoaib.pdf:  Sun’s Niagra Chip: Chip Multi-Threaded Processor http://gamma.cs.unc.edu/EDGE/SLIDES/shoaib.pdf Slide30:  Multi-core architectures Processors lowering communication costs Heterogeneous processors Current Architectures Slide31:  Reduce communication costs [Dally’03] PCA architectures: http://www.darpa.mil/ipto/Programs/pca/index.htm GPUs Streaming processors:http://cva.stanford.edu/publications/2004/spqueue.pdf Other data parallel processors (PPUs, ClearSpeed) FPGAs Efficient Processors Slide32:  Multi-core architectures Processors lowering communication costs Heterogeneous processors Combining different type of processors in one chip Current Architectures Slide33:  Cell BE Processor AMD Fusion Architecture Heterogeneous Processors Cell BE Processor Overview:  Cell BE Processor Overview IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opened in March 2001 Based in Austin, Texas ~$400M Investment February 7, 2005: First technical disclosures Designed for Sony PlayStation3 Commodity processor Cell is an extension to IBM Power family of processors Sets new performance standards for computation andamp; bandwidth High affinity to HPC workloads Seismic processing, FFT, BLAS, etc. Cell BE Processor Features:  Cell BE Processor Features Heterogeneous multi-core system architecture Power Processor Element for control tasks Synergistic Processor Elements for data-intensive processing Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (SMF) Data movement and synchronization Interface to high-performance Element Interconnect Bus 16B/cycle (2x) 16B/cycle BIC FlexIOTM MIC Dual XDRTM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE L2 32B/cycle Cell BE Architecture:  Cell BE Architecture Combines multiple high performance processors in one chip 9 cores, 10 threads A 64-bit Power Architecture™ core (PPE) 8 Synergistic Processor Elements (SPEs) for data-intensive processing Current implementation—roughly 10 times the performance of Pentium for computational intensive tasks Clock: 3.2 GHz (measured at andgt;4GHz in lab) Peak GFLOPs (Cell SPEs only):  Peak GFLOPs (Cell SPEs only) FreeScale DC 1.5 GHz PPC 970 2.2 GHz AMD DC 2.2 GHz Intel SC 3.6 GHz Cell 3.0 GHz Cell BE Processor Can Support Many Systems:  Cell BE Processor Can Support Many Systems Game console systems Blades HDTV Home media servers HPC … Cell BE Processor XDRtm XDRtm IOIF0 IOIF1 Cell BE Processor XDRtm XDRtm IOIF BIF Cell BE Processor XDRtm XDRtm IOIF Cell BE Processor XDRtm XDRtm IOIF BIF Cell BE Processor XDRtm XDRtm IOIF Cell BE Processor XDRtm XDRtm IOIF BIF Cell BE Processor XDRtm XDRtm IOIF SW Slide39:  Cell BE Processor AMD Fusion Architecture Heterogeneous Processors AMDs Fusion Architecture:  AMDs Fusion Architecture AMDs Fusion Architecture:  AMDs Fusion Architecture AMDs Fusion Architecture:  AMDs Fusion Architecture AMDs Fusion Architecture:  AMDs Fusion Architecture Slide44:  Current architectures Use of Accelerators Programming environments for accelerators Organization Slide45:  Current architectures Use of Accelerators Single workstation (real-world) applications High performance computing Programming environments for accelerators Organization NON-Graphics Pipeline Abstraction (GPGPU):  data setup rasterizer data data data data fetch, fp16 blending NON-Graphics Pipeline Abstraction (GPGPU) programmable MIMD processing (fp32) programmable SIMD processing (fp32) lists SIMD 'rasterization' predicated write, fp16 blend, multiple output memory Courtesy: David Kirk, NVIDIA Sorting and Searching:  Sorting and Searching 'I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!' -Don Knuth Massive Databases:  Massive Databases Terabyte-data sets are common Google sorts more than 100 billion terms in its index andgt; 1 Trillion records in web indexed (unconfirmed sources) Database sizes are rapidly increasing! Max DB sizes increases 3x per year (http://www.wintercorp.com) Processor improvements not matching information explosion General Sorting on GPUs:  General Sorting on GPUs Design sorting algorithms with deterministic memory accesses – 'Texturing' on GPUs 86 GB/s peak memory bandwidth (NVIDIA 8800) Can better hide the memory latency!! Require minimum and maximum computations – 'Blending functionality' on GPUs Low branching overhead No data dependencies Utilize high parallelism on GPUs Sorting on GPU: Pipelining and Parallelism:  Sorting on GPU: Pipelining and Parallelism Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes Comparison with prior GPU-Based Algorithms:  Comparison with prior GPU-Based Algorithms 3-6x faster than prior GPU-based algorithms! Sorting: GPU vs. Multi-Core CPUs:  Sorting: GPU vs. Multi-Core CPUs 2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon Hand-optimized CPU code from Intel Corporation! External Memory Sorting:  External Memory Sorting Performed on Terabyte-scale databases Two phases algorithm Limited main memory First phase – partitions input file into large data chunks and writes sorted chunks known as 'Runs' Second phase – Merge the 'Runs' to generate the sorted file N. Govindaraju, J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006 External memory sorting using GPUs:  External memory sorting using GPUs External memory sorting on CPUs can have low performance due to High memory latency Low I/O performance Our GPU-based algorithm Sorts large data arrays on GPUs Perform I/O operations in parallel on CPUs GPUTeraSort:  GPUTeraSort Govindaraju et al., SIGMOD 2006 Overall Performance:  Overall Performance Faster and more scalable than Dual Xeon processors (3.6 GHz)! Performance/$:  Performance/$ 1.8x faster than current Terabyte sorter World’s best performance/$ system GPUTeraSort: PennySort Winner 2006:  GPUTeraSort: PennySort Winner 2006 'These results paint a clear picture for progress on processor speeds. When you measure records-sorted-per-cpu-second, the speed plateaued in 1995 at about 200k records/second/cpu . This year saw a breakthrough with GpuTeraSort which uses the GPU interface to drive the memory more efficiently (and uses the 10x more memory bandwidth inside the GPU). GpuTeraSort gave a 3x records/second/cpu improvement There is a lot of effort on multi-core processors, and comparatively little effort on addressing the 'core' problems: (1) the memory architecture, and (2) the way processors access memory. Sort demonstrates those problems very clearly.' By Jim Gray (Microsoft) [NY Times, November 2006] GPUFFTW (1D & 2D FFT):  GPUFFTW (1D andamp; 2D FFT) N. Govindaraju, S. Larsen, J. Gray and D. Manocha, SuperComputing 2006 Download URL: http://gamma.cs.unc.edu/GPUFFTW 4x faster than IMKL on high-end Quad cores SlashDot Headlines, May 2006 Digital Breast Tomosynthesis (DBT):  Digital Breast Tomosynthesis (DBT) 100X reconstruction speed-up with NVIDIA Quadro FX 4500 GPU From hours to minutes Facilitates clinical use Improved diagnostic value Clearer images Fewer obstructions Earlier detection Axis of rotation Compressed breast Digital detector X-Ray tube Compression paddle 11 Low-dose X-ray Projections Extremely Computationally Intense Reconstruction Advanced Imaging Solution of the Year 'Mercury reduced reconstruction time from 5 hours to 5 minutes, making DBT clinically viable. …among 70 women diagnosed with breast cancer, DBT pinpointed 7 cases not seen with mammography' Pioneering DBT work at Massachusetts General Hospital Electromagnetic Simulation:  Electromagnetic Simulation 3D Finite-Difference and Finite-Element Modeling of: Cell phone irradiation MRI Design / Modeling Printed Circuit Boards Radar Cross Section (Military) Computationally Intensive! Large speedups with Quadro GPUs Pacemaker with Transmit Antenna Commercial, Optimized, Mature Software Single CPU, 3.x GHz 5X 10X 1X 18X 4 2 1 0 # Quadro FX 4500 GPUs Havok FX Physics on NVIDIA GPUs:  Havok FX Physics on NVIDIA GPUs Physics-based effects on a massive scale 10,000s of objects at high frame rates Rigid bodies Particles Fluids Cloth and more Dedicated Performance For Physics:  Dedicated Performance For Physics Performance Measurement 15,000 Boulder Scene Frame Rate CPU Physics Dual Core P4EE 955 - 3.46GHz GeForce 7900GTX SLI CPU Multi-threading enabled GPU Physics Dual Core P4EE 955 - 3.46GHz GeForce 7900GTX SLI CPU Multi-threading enabled 6.2 fps 64.5 fps GPUs: High Memory Throughput:  GPUs: High Memory Throughput 50 GB/s on a single GPU (NVIDIA 7900) Peak Performance: Effectively hide memory latency with 15 GOP/s Microsoft Vista & GPUs:  Microsoft Vista andamp; GPUs Windows Vista is the first Windows operating system that directly utilizes the power of a dedicated GPU. High-end GPUs are essential for accelerating the Windows Vista experience by offering an enriched 3D user interface, increased productivity, vibrant photos, smooth, high-definition videos, and realistic games. GPUs as Accelerators:  GPUs as Accelerators GPUs are primarily designed for rasterization GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance GPUs as Accelerators:  GPUs as Accelerators GPUs are primarily designed for rasterization GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance Inspite of these limitations good speedups were demonstrated GPUs as Accelerators:  GPUs as Accelerators GPUs are primarily designed for rasterization GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance Inspite of these limitations good speedups were demonstrated What if we have the right API and programming environment for GPUs? Accelerators for HPC:  Accelerators for HPC Recent Trends is to use Accelerators to achieve 100-1000 TFlop performance RoadRunner (LANL): plans to use 16,000 cell processors (expected PetaFlop performance) Tsubame cluster (Tokyo): 360 ClearSpeed accelerators (47 TFlop performance) Slide70:  Current architectures Use of Accelerators Programming environments for accelerators Organization Thread parallelism is upon us (Smith’06):  Thread parallelism is upon us (Smith’06) Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems Thread parallelism is upon us (Smith’06):  Thread parallelism is upon us (Smith’06) Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems Meanwhile, logic cost ($ per gate-Hz) continues to fall What are we going to do with all that hardware? Thread parallelism is upon us (Smith’06):  Thread parallelism is upon us (Smith’06) Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems Meanwhile, logic cost ($ per gate-Hz) continues to fall What are we going to do with all that hardware? Newer microprocessors are multi-core, and/or multithreaded So far, it’s just 'more of the same' architecturally Now we also have heterogeneous processors Thread parallelism:  Thread parallelism We expect new 'killer apps' will need more performance Semantic analysis and query Improved human-computer interfaces (e.g. speech, vision) Games Which and how much thread parallelism can we exploit? This is a good question for both hardware and software Slide75:  Data parallel processors Improved APIs and interfaces Programming the Accelerators Possible Approaches:  Possible Approaches Extend existing high-level languages with new data-parallel array types Ease of programming Implement as a library so programmers can use it now Eventually fold into base languages Build implementations with compelling performance Target GPUs and multi-core CPUs Create examples and applications Educate programmers, provide sample code Challenges in using GPUs:  Challenges in using GPUs Need a non-graphics interface For more flexibility Less execution overhead Need native GPU support Replace library with language built-ins Need to learn from users Retarget for multi-core Research Issues:  Research Issues Languages for mainstream parallel computing Compilation techniques for parallel programs Debugging and performance tuning of parallel programs Operating systems for parallel computing at all scales Computer architecture for mainstream parallel computing

Add a comment

Related presentations

Related pages

Lecture II.4.2 Freud Jung & Kundalini Yoga, Pt. 2 ...

Ich stimme zu, dass diese Seite Cookies für Analysen, personalisierten Inhalt und Werbung verwendet.
Read more

Lecture II – Institut für Nachrichtentechnik – Technische ...

Vorlesung Akustik II, Zyklus: SS (2 + 0) Wünschenswerte Voraussetzungen: Vordiplom Elektrotechnik oder Physik. Inhalt der Lehrveranstaltung: In dieser ...
Read more

Statistik II - Mobile Lecture Uni Bremen

Datum: Informationen: Aufzeichnung: Folien: Download: 30.10.08: Statistik II. Dozent: Prof. Dr. Uwe Engel Uhrzeit: von 10:00 bis 12:00: Mobile Lecture: 06 ...
Read more

LECTURE II. Babylon and Greece - Internet Sacred Text ...

p. 22. LECTURE II. Babylon and Greece. The relations of Greek philosophy with oriental theologies form a subject of vast extent, which has long been discussed.
Read more

Frankfurt Lecture II: Nancy Fraser - normativeorders.net

Frankfurt Lecture II: Nancy Fraser Nancy Fraser: The Crisis of Capitalism. 19. und 20. April 2010, jeweils 19.00 Uhr Goethe-Universität Frankfurt a.M ...
Read more

Mobile Lecture Uni Bremen

Mobile Lecture & Adblock ... Praktische Informatik II; Forum INPUTS; Theoretische Informatik 2; Produktion und Logistik; Universität Bremen ...
Read more

Lecture "Hydrodynamics II: Numerical methods and applications"

Lecture on: Hydrodynamics II: Numerical methods and applications University of Heidelberg Summer Semester, 2007 C.P. Dullemond and A. Johansen
Read more

REM Lecture WS 1516 ii - immobilienwirtschaft.tu-berlin.de

Anmeldung bis zum 16.12.2015 bitte an: info@rem-berlin.de Die immobilienwirtschaftliche Ringvorlesungen „REM-Lecture“ ist eine Kooperation des ...
Read more

studiumdigitale eLectureportal — Vorlesungen

Die Goethe-Universität ist eine forschungsstarke Hochschule in der europäischen Finanzmetropole Frankfurt. Lebendig, urban und weltoffen besitzt sie als ...
Read more

Visualisierung @ TU Braunschweig

Visualisierung @ TU Braunschweig ... Literatur zu LaTeX: z.B. - LaTeX, Bd. 1: Einführung, Kopka, Pearson Studium
Read more