Systems Support for Many Task Computing

75 %
25 %
Information about Systems Support for Many Task Computing
Technology

Published on July 23, 2009

Author: ericvh

Source: slideshare.net

Description

A look at using aggregation as a first class construct within operating systems to enable scaling applications and services.

Systems Support for Many Task Computing Holistic Aggregate Resource Environment Eric Van Hensbergen (IBM) and Ron Minnich (Sandia National Labs)

Motivation

Overview of Approach Targeting Blue Gene/P provide a complimentary runtime environment Using Plan 9 Research Operating System “Right Weight Kernel” - balances simplicity and function Built from the ground up as a distributed system Leverage HPC interconnects for system services Distribute system services among compute nodes Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.

Targeting Blue Gene/P

provide a complimentary runtime environment

Using Plan 9 Research Operating System

“Right Weight Kernel” - balances simplicity and function

Built from the ground up as a distributed system

Leverage HPC interconnects for system services

Distribute system services among compute nodes

Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.

Related Work Default Blue Gene runtime Linux on I/O nodes + CNK on compute nodes High Throughput Computing (HTC) Mode Compute Node Linux ZeptoOS Kittyhawk

Default Blue Gene runtime

Linux on I/O nodes + CNK on compute nodes

High Throughput Computing (HTC) Mode

Compute Node Linux

ZeptoOS

Kittyhawk

Foundation: Plan 9 Distributed System Right Weight Kernel General purpose multi-thread, multi-user environment Pleasantly portable Relatively Lightweight (compared to Linux) Core Principles All resources are synthetic file hierarchies Local & remote resources accessed via simple API Each thread can dynamically organize local and remote resources via dynamic private namespace

Right Weight Kernel

General purpose multi-thread, multi-user environment

Pleasantly portable

Relatively Lightweight (compared to Linux)

Core Principles

All resources are synthetic file hierarchies

Local & remote resources accessed via simple API

Each thread can dynamically organize local and remote resources via dynamic private namespace

Everything Represented as File Systems Console, Audio, Etc. Wiki, Authentication, and Service Control Process Control, Debug, Etc. Hardware Devices System Services Application Services Disk Network TCP/IP Stack DNS GUI /dev/eth0 /net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status /net /cs /dns /win /clone /0 /1 /ctl /data /refresh /2 /dev/hda1 /dev/hda2

Plan 9 Networks Internet High Bandwidth (10 GB/s) Network LAN (1 GB/s) Network Wifi/Edge Cable/DSL Content Addressable Storage File Server CPU Servers CPU Servers PDA Smartphone Term Term Term Term Set Top Box Screen Phone )‏ )‏ )‏

An Issue of Scale Chip BG/p – 4 way System 72 Racks Node Card (4x4x2) 32 compute 0-2 IO cards Compute Card 2 chips Rack 32 Node Cards

Aggregation as a First Class Concept Local Service Aggregate Service Remote Service Proxy Service Remote Service Remote Service

Issues of Topology

File Cache Example Proxy Service Monitors access to remote file server & local resources Local cache mode Collaborative cache mode Designated cache server(s) Integrate replication and redundancy Explore write coherence via “territories” ala Envoy Based on experiences with Xget deployment model Leverage natural topology of machine where possible.

Proxy Service

Monitors access to remote file server & local resources

Local cache mode

Collaborative cache mode

Designated cache server(s)

Integrate replication and redundancy

Explore write coherence via “territories” ala Envoy

Based on experiences with Xget deployment model

Leverage natural topology of machine where possible.

Monitoring Example Distribute monitoring throughout the system Use for system health monitoring and load balancing Allow for application-specific monitoring agents Distribute filtering & control agents at key points in topology Allow for localized monitoring and control as well as high-level global reporting and control Explore both push and pull methods of modeling Based on experiences with supermon system.

Distribute monitoring throughout the system

Use for system health monitoring and load balancing

Allow for application-specific monitoring agents

Distribute filtering & control agents at key points in topology

Allow for localized monitoring and control as well as high-level global reporting and control

Explore both push and pull methods of modeling

Based on experiences with supermon system.

Workload Management Example Provide file system interface to job execution and scheduling. Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls. Can allow for more organic growth of workloads as well as top-down and bottom-up models. Can be extended to allow direct access from end-user workstations. Based on experiences with Xcpu mechanism.

Provide file system interface to job execution and scheduling.

Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.

Can allow for more organic growth of workloads as well as top-down and bottom-up models.

Can be extended to allow direct access from end-user workstations.

Based on experiences with Xcpu mechanism.

Status Initial Port to BG/P 90% Complete Applications Linux emulation environment CNK emulation environment Native ports of applications Also have a port of Inferno Virtual Machine to BG/P Runs on Kittyhawk as well as Native Baseline boot & runtime infrastructure complete

Initial Port to BG/P 90% Complete

Applications

Linux emulation environment

CNK emulation environment

Native ports of applications

Also have a port of Inferno Virtual Machine to BG/P

Runs on Kittyhawk as well as Native

Baseline boot & runtime infrastructure complete

HARE Team David Eckhardt (Carnegie Mellon University) Charles Forsyth (Vitanuova) Jim McKie (Bell Labs) Ron Minnich (Sandia National Labs) Eric Van Hensbergen (IBM Research)

David Eckhardt (Carnegie Mellon University)

Charles Forsyth (Vitanuova)

Jim McKie (Bell Labs)

Ron Minnich (Sandia National Labs)

Eric Van Hensbergen (IBM Research)

Thanks Funding This material is based upon work supported by the Department of Energy under Aware Number DE-FG02-08ER25851 Resources This work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE) Information The authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance.

Funding

This material is based upon work supported by the Department of Energy under Aware Number DE-FG02-08ER25851

Resources

This work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE)

Information

The authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance.

Questions? Discussion?

Links FastOS Web Site http://www.cs.unm.edu/~fastos/ Phase II CFP http://www.sc.doe.gov/grants/FAPN07-23.html BlueGene http://www.research.ibm.com/bluegene/ Plan 9 http://plan9.bell-labs.com/plan9 LibraryOS http://www.research.ibm.com/prose

FastOS Web Site

http://www.cs.unm.edu/~fastos/

Phase II CFP

http://www.sc.doe.gov/grants/FAPN07-23.html

BlueGene

http://www.research.ibm.com/bluegene/

Plan 9

http://plan9.bell-labs.com/plan9

LibraryOS

http://www.research.ibm.com/prose

 

Plan 9 Characteristics Kernel Breakdown - Lines of Code Architecture Specific Code BG/L: ~10,000 lines of code Portable Code Port: ~25,000 lines of code TCP/IP Stack: ~14,000 lines of code Binary Sizes 415k Text + 140k Data + 107k BSS Runtime Memory Footprint ~4 MB for compute node kernels – could be smaller or larger depending on application specific tuning.

Kernel Breakdown - Lines of Code

Architecture Specific Code

BG/L: ~10,000 lines of code

Portable Code

Port: ~25,000 lines of code

TCP/IP Stack: ~14,000 lines of code

Binary Sizes

415k Text + 140k Data + 107k BSS

Runtime Memory Footprint

~4 MB for compute node kernels – could be smaller or larger depending on application specific tuning.

Why not Linux? Not a distributed system Core systems inflexible VM based on x86 MMU Networking tightly tied to sockets & TCP/IP w/long call-path Typical installations extremely overweight and noisy Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change Community has become conservative Support for alternative interfaces waning Support for large systems which hurts small systems not acceptable Ultimately a customer constraint FastOS was developed to prevent OS monoculture in HPC Few Linux projects were even invited to submit final proposals

Not a distributed system

Core systems inflexible

VM based on x86 MMU

Networking tightly tied to sockets & TCP/IP w/long call-path

Typical installations extremely overweight and noisy

Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change

Community has become conservative

Support for alternative interfaces waning

Support for large systems which hurts small systems not acceptable

Ultimately a customer constraint

FastOS was developed to prevent OS monoculture in HPC

Few Linux projects were even invited to submit final proposals

FTQ on BG/L IO Node running Linux

FTQ on BG/L IO Node Running Plan 9

Right Weight Kernels Project (Phase I) Motivation OS Effect on Applications Metric is based on OS Interference on FWQ & FTQ benchmarks. AIX/Linux has more capability than many apps need LWK and CNK have less capability than apps want Approach Customize the kernel to the application Ongoing Challenges Need to balance capability with overhead

Motivation

OS Effect on Applications

Metric is based on OS Interference on FWQ & FTQ benchmarks.

AIX/Linux has more capability than many apps need

LWK and CNK have less capability than apps want

Approach

Customize the kernel to the application

Ongoing Challenges

Need to balance capability with overhead

Why Blue Gene? Readily available large-scale cluster Minimum allocation is 37 nodes Easy to get 512 and 1024 node configurations Up to 8192 nodes available upon request internally FastOS will make 64k configuration available DOE interest – Blue Gene was a specified target Variety of interconnects allows exploration of alternatives Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware

Readily available large-scale cluster

Minimum allocation is 37 nodes

Easy to get 512 and 1024 node configurations

Up to 8192 nodes available upon request internally

FastOS will make 64k configuration available

DOE interest – Blue Gene was a specified target

Variety of interconnects allows exploration of alternatives

Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware

Department of Energy FastOS CFP aka: Operating and Runtime System for Extreme Scale Scientific Computation (DE-PS02-07ER07-23) Goal: Stimulate R&D related to operating and runtime systems for petascale systems in the 2010 to 2015 time frame. Expected Output Unified operating and runtime system that could fully support and exploit petascale and beyond systems. Near Term Hardware Targets: Blue Gene, Cray XD3, and HPCS Machines.

Goal:

Expected Output

Near Term Hardware Targets:

Blue Gene, Cray XD3, and HPCS Machines.

Blue Gene Interconnects 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 µs latency between nearest neighbors, 5 µs to the farthest 4 µs latency for one hop with MPI, 10 µs to the farthest Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024) Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt Latency of round trip 1.3 µs Control Network

Interconnects all compute nodes (65,536)

Virtual cut-through hardware routing

1.4Gb/s on all 12 node links (2.1 GB/s per node)

1 µs latency between nearest neighbors, 5 µs to the farthest

4 µs latency for one hop with MPI, 10 µs to the farthest

Communications backbone for computations

0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth

One-to-all broadcast functionality

Reduction operations functionality

2.8 Gb/s of bandwidth per link

Latency of one way tree traversal 2.5 µs

~23TB/s total binary tree bandwidth (64k machine)

Interconnects all compute and I/O nodes (1024)

Incorporated into every node ASIC

Active in the I/O nodes (1:64)

All external comm. (file I/O, control, user interaction, etc.)

Latency of round trip 1.3 µs

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

System Support for Many Task Computing - researchgate.net

System Support for Many Task Computing Eric Van Hensbergen IBM Research bergevan@us.ibm.com Ron Minnich Sandia National Labs rminnich@sandia.gov Abstract
Read more

System support for many task computing - ResearchGate

Publication » System support for many task computing. ... "A related model would be to use network topology in order to address resources (Figure 1(b)).
Read more

Systems Support for Many Task Computing Holistic Aggregate ...

Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation Systems Support for ... Distribute system services among compute nodes
Read more

JETS: Language and System Support for Many Parallel Task ...

JETS: Language and System Support for Many Parallel Task Computing Justin M. Wozniak Argonne National Laboratory Argonne, IL, USA Mike Wilde Argonne ...
Read more

System support for many task computing | DeepDyve

System support for many task computing Van Hensbergen, E.; Minnich, R. The popularity of large scale systems such as Blue Gene has extended their reach ...
Read more

JETS: Language and System Support for Many Parallel Task ...

Title: JETS: Language and System Support for Many Parallel Task Computing: Publication Type: Conference Paper: Year of Publication: 2011: Authors: Wozniak ...
Read more

Presentation "JETS: Language and System Support for Many ...

JETS: Language and System Support for Many-Parallel-Task Computing Justin M Wozniak and Michael Wilde Argonne National Laboratory Presented by: Pavan Balaji,
Read more

System support for many task computing - 2008 Workshop on ...

2008 Workshop on Many-Task Computing on Grids and Supercomputers. ... System support for many task computing ... 2008 Workshop on Many-Task Computing on ...
Read more

ExM: System support for extreme-scale, many-task applications

AME: An Anyscale Many-Task Computing Engine, Zhao Zhang, Daniel Katz, Matei Ripean, Mike Wilde, Ian Foster, 6th Workshop on Workflows in Support of Large ...
Read more