PPoPP2006

50 %
50 %
Information about PPoPP2006
Entertainment

Published on September 18, 2007

Author: Malbern

Source: authorstream.com

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links:  Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan Authors:  Authors Ernie Chan Robert van de Geijn Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Testbed Architecture:  Testbed Architecture IBM Blue Gene/L 3D torus point-to-point interconnect network One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes Special feature to send simultaneously Use multiple calls to MPI_Isend Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Model of Parallel Computation:  Model of Parallel Computation Target Architectures Distributed-memory parallel architectures Indexing p computational nodes Indexed 0 … p - 1 Logically Fully Connected A node can send directly to any other node Model of Parallel Computation:  Model of Parallel Computation Topology N-dimensional torus 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Model of Parallel Computation:  Model of Parallel Computation Old Model of Communicating Between Nodes Unidirectional sending or receiving Model of Parallel Computation:  Model of Parallel Computation Old Model of Communicating Between Nodes Simultaneous sending and receiving Model of Parallel Computation:  Model of Parallel Computation Old Model of Communicating Between Nodes Bidirectional exchange Model of Parallel Computation:  Model of Parallel Computation Communicating Between Nodes A node can send or receive with 2N other nodes simultaneously along its 2N different links Model of Parallel Computation:  Model of Parallel Computation Communicating Between Nodes Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes Model of Parallel Computation:  Model of Parallel Computation Cost of Communication α + nβ α: startup time, latency n: number of bytes to communicate β: per data transmission time, bandwidth Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Sending Simultaneously:  Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes Cost to send to m separate nodes (α + nβ) m Sending Simultaneously:  Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) Sending Simultaneously:  Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one send Cost of extra sends Sending Simultaneously:  Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one send Cost of extra sends 0 ≤ τ ≤ 1 Sending Simultaneously:  Sending Simultaneously Benchmarking Sending Simultaneously Logarithmic-Logarithmic timing graphs Midplane – 512 nodes Sending simultaneously with 1 – 6 neighbors 8 bytes – 4 MB Sending Simultaneously:  Sending Simultaneously Sending Simultaneously:  Sending Simultaneously Sending Simultaneously:  Sending Simultaneously Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ) Sending Simultaneously:  Sending Simultaneously Sending Simultaneously:  Sending Simultaneously Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Collective Communication:  Collective Communication Broadcast (Bcast) Motivating example Before After Collective Communication:  Collective Communication Scatter Before After Collective Communication:  Collective Communication Allgather Before After Collective Communication:  Collective Communication Broadcast Can be implemented as a Scatter followed by an Allgather Slide30:  Scatter Allgather Collective Communication:  Collective Communication Lower Bounds: Latency Broadcast log2N+1(p) α Scatter log2N+1(p) α Allgather log2N+1(p) α Collective Communication:  Collective Communication Lower Bounds: Bandwidth Broadcast nβ 2N Scatter p - 1 nβ p 2N Allgather p - 1 nβ p 2N Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Generalized Algorithms:  Generalized Algorithms Short-Vector Algorithms Minimum-Spanning Tree Long-Vector Algorithms Bucket Algorithm Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Recursively divide network of nodes in half Cost of MST Bcast log2(p) (α + nβ) What if can send to N nodes simultaneously? Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Divide p nodes into N+1 partitions Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Disjointed partitions on N-dimensional mesh 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Divide dimensions by a decrementing counter from N+1 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Now divide into 2N+1 partitions 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree Cost of new Generalized MST Bcast log2N+1(p) (α + nβ) Attains lower bound for latency! Generalized Algorithms:  Generalized Algorithms Minimum-Spanning Tree MST Scatter Only send data that must reside in that partition at each step Cost of new generalized MST Scatter Attains lower bound for latency and bandwidth! log2N+1(p) α + p - 1 p nβ 2N Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Send n/p sized data messages at each step Cost of Bucket Allgather What if can send to N nodes simultaneously? p - 1 p nβ (p - 1) α + Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Collect data around N buckets simultaneously 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Cannot send to N neighbors at each step 5 9 11 3 7 8 0 10 12 13 15 1 4 14 6 2 Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Assume collecting data in buckets is free in all but one dimension D is an N-ordered tuple representing the number of nodes in each dimension of the torus π Di = p 0 andlt; i ≤ N, Di andgt; 1 | D | = N i = 1 N Generalized Algorithms:  Generalized Algorithms Bucket Algorithm Cost of the new generalized Bucket Allgather where Dj - 1 Dj nβ (d - N) α + d = Σ Di i ≠ j, Dj ≥ Di A i = 1 N Generalized Algorithms:  Generalized Algorithms Bucket Algorithm New generalized Bcast derived from MST Scatter followed by Bucket Allgather Cost of new long-vector Bcast p - 1 Dj - 1 2Np Dj nβ (log2N+1(p) + d - N) α + + ( ) Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Performance Results:  Performance Results Logarithmic-Logarithmic Timing Graphs Collective Communication Operations Broadcast Scatter Allgather Algorithms MST Bucket 8 bytes – 4 MB Performance Results:  Performance Results Single point-to-point communication Performance Results:  Performance Results my-bcast-MST Performance Results:  Performance Results Performance Results:  Performance Results Performance Results:  Performance Results Performance Results:  Performance Results Performance Results:  Performance Results Outline:  Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion Conclusion:  Conclusion IBM Blue Gene/L supports functionality of sending simultaneously Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains Conclusion:  Conclusion Future Directions Room for optimization to reduce implementation overhead What if not using MPI_COMM_WORLD? Possible new algorithm for Bucket Algorithm Questions? echan@cs.utexas.edu

Add a comment

Related presentations

Related pages

dblp: PPOPP 2006: New York, New York, USA

Bibliographic content of PPOPP 2006: New York, New York, USA
Read more

LNCS 8410 - Composable Transactional Objects: A Position Paper

andPracticeofParallelProgramming,PPoPP2006,pp.187–197.ACM,NewYork (2006) Title: LNCS 8410 - Composable Transactional Objects: A Position Paper Author:
Read more

Towards high-performance flow-level packet processing on ...

... (PPoPP2006), 2006. [17] Intel Corporation, “Intel IXP2850 Network Processor Hardware Reference Manual”, 2004. [18] Intel Corporation, ...
Read more

Bibliography for William Gropp - How do I get a website ...

author = "William Gropp and Ewing Lusk and Anthony Skjellum", title = "Using {MPI}: Portable Parallel Programming with the Message-Passing ...
Read more

Towards high-performance flow-level packet processing on ...

... (PPoPP2006), 2006. [17] Intel Corporation, “Intel IXP2850 Network Processor Hardware Reference Manual”, 2004. [18] Intel Corporation, “Intel ...
Read more

iss.bib

... -189-9}, location = {New York, New York, USA}, tags = {ft}, url = {http://iss.ices.utexas.edu/Publications/Papers/PPOPP2006.pdf} } ...
Read more

Read gropp-bib.pdf - Readbag

Readbag users suggest that gropp-bib.pdf is worth reading. The file contains 50 page(s) and is free to view, download or print.
Read more

中国科大在国际并行处理会议上发表论文

... Processors被国际并行处理会议--ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006年(PPoPP2006)接受。
Read more

中国科大在国际并行处理会议上发表论文 ...

... (ppopp2006)接受。唐锡南和胡向辉参加了在纽约召开的该国际会议,并于3月31号在大会上宣读此论文。这是该会自1988 ...
Read more