Parallel and Distributed Computing on Low Latency Clusters

100 %
0 %
Information about Parallel and Distributed Computing on Low Latency Clusters
Technology

Published on November 19, 2009

Author: ProjectSymphony

Source: slideshare.net

Description

Slides from the thesis defence in Chicago by Vittorio Giovara.

Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009

Contents • Motivation • Application • Strategy • Compiler Optimizations • Technologies • OpenMP and MPI over Infinband • OpenMP • Results • MPI • Conclusions • Infinband

Motivation

Motivation • Scaling trend has to stop for CMOS technology: ✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty • Foundamental reason: rising fab cost

Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software • New classification for computer architectures

Classification SISD SIMD data pool data pool instruction pool instruction pool CPU CPU CPU MISD MIMD data pool data pool instruction pool instruction pool CPU CPU CPU CPU CPU CPU

easier to parallelize abstraction level algorithm loop level process management

Levels recursion memory management profiling data dependency branching overhead control flow algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling

Backfire • Difficutly to fully exploit the parallelism offered • Automatic tools required to adapt software to parallelism • Compiler support for manual or semi- automatic enhancement

Applications • OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software • Mathematics and Physics • Computer Science • Biomedics

Specific Problem and Background • Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering) • Computationally intensive (even days of CPU); speedup required • Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)

Strategy

Strategy • Install a Linux Kernel with ad-hoc configuration for scientific computation • Compile a OpenMP enable GCC (supported from 4.3.1 onwards) • Add the Infiniband link among clusters with proper drivers in kernel and user space • Select a MPI implementation library

Strategy • Verify Infiniband network through some MPI test examples • Install the target software • Proceed to include OpenMP and MPI directives in the code • Run test cases

OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods

OpenMP - example

OpenMP - example Parallel Task 1 Parallel Task 3 Parallel Task 2 Parallel Task 4

Parallel Task 1 Parallel Task 2 Thread A Parallel Task 4 Thread B Parallel Task 3 Join Master Thread

OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided

OpenMP Scheduler OpenMP Static Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000 102375 87750 73125 microseconds 58500 43875 29250 14625 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler OpenMP Guided Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler

OpenMP Scheduler static scheduler dynamic scheduler guided scheduler

MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH

Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed

MPI over Infiniband 10000000,0 µs 1000000,0 µs 100000,0 µs 10000,0 µs 1000,0 µs 100,0 µs 10,0 µs 1,0 µs kB kB kB kB kB kB 12 B 25 B 51 B kB B B B B 32 B 64 B 12 B 25 B 51 B B B B B B B k k k M M M M M M M M M M G G G G G 1 2 4 8 16 32 64 8 6 2 1 2 4 8 16 1 2 4 8 16 8 6 2 OpenMPI Mvapich2

MPI over Infiniband 10000000,00 µs 1000000,00 µs 100000,00 µs 10000,00 µs 1000,00 µs 100,00 µs 10,00 µs 1,00 µs kB kB kB kB kB kB kB kB kB kB B B B B M M M M 1 2 4 8 16 32 64 8 6 2 1 2 4 8 12 25 51 OpenMPI Mvapich2

Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN • Consistent documentation available • Unexpected positive results

Optimizations •-march = native •-O3 •-ffast-math •-Wl,-O1

Target Software

Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models

Implementation Scheme sequential loop parallel loop standard programming model OpenMP Threads distributed loop OpenMP Threads OpenMP Threads Host 1 Host 2 MPI

Implementation Scheme • Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required ➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion

Results

Results OMP MPI OPT seconds * * * 133 * * - 400 * - * 186 * - - 487 - * * 200 - * - 792 - - * 246 - - - 1062 Total Speed Increase: 87.52%

Actual Results OMP MPI seconds * * 59 * - 129 - * 174 - - 249 Function Name Normal OpenMP MPI OpenMP+MPI calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s

Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%

Conclusions

Conclusions and Future Works • Computational time has been significantly decreased • Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • Continue inserting OpenMP and MPI directives • Perform algorithm optimizations • Increase cluster size

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Sparrow: Distributed, Low Latency Scheduling

Sparrow: Distributed, Low Latency Scheduling ... ing for Distributed Computing Clusters. ... actions on Parallel and Distributed Computing, 12 ...
Read more

Parallel computing - Wikipedia, the free encyclopedia

Parallel computing is a type of computation in ... Because of the low bandwidth and extremely high latency available ... parallel, and distributed computing;
Read more

Cluster Computing: High-Performance, High-Availability ...

Cluster Computing: High-Performance, High-Availability, ... low latency communication software ... A framework for parallel distributed computing ...
Read more

Low cost cluster architectures for parallel and ...

Low cost cluster architectures for parallel and ... Low latency Cluster Machine ... Parallel and Distributed Computing Handbook.
Read more

Linux Parallel Processing HOWTO: Clusters Of Linux Systems

3. Clusters Of Linux Systems. ... This leads many people to suggest that cluster parallel computing can simply claim all ... low-latency messages are ...
Read more

A Cluster Testbed for SCI-based Parallel Processing (1995)

... A Cluster Testbed for SCI-based ... communications and low-latency multithreading ... Parallel Distributed Computing on SCI ...
Read more

Low-Latency Message Passing on Workstation Clusters using ...

Low-Latency Message Passing on Workstation Clusters ... parallel and distributed computing. ... workstation clusters which have low latency as well ...
Read more

Cluster Computing and Applications - The CLOUDS Lab ...

Cluster Computing and Applications ... A cluster is a type of parallel or distributed computer system, ... and low-latency interconnect such as Myrinet or ...
Read more

Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing; Systems ... processor communication volume and low latency for several fundamental problems ...
Read more