advertisement

Cruz:*Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

50 %
50 %
advertisement
Information about Cruz:*Application-Transparent Distributed Checkpoint-Restart on Standard...
Technology

Published on October 20, 2008

Author: mjf7419

Source: slideshare.net

Description

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems
advertisement

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti § , Yoshio Turner HP Labs §: Currently at Meiosys, Inc.

Broad Opportunity for Checkpoint-Restart in Server Management Fault tolerance (minimize unplanned downtime) Recover by restarting from checkpoint Minimize planned downtime Migrate application before hardware/OS maintenance Resource management Manage resource allocation in shared computing environments by migrating applications

Fault tolerance (minimize unplanned downtime)

Recover by restarting from checkpoint

Minimize planned downtime

Migrate application before hardware/OS maintenance

Resource management

Manage resource allocation in shared computing environments by migrating applications

Need for General-Purpose Checkpoint-Restart Existing checkpoint-restart methods are too limited: No support for many OS resources that commercial applications use (e.g., sockets) Limited to applications using specific libraries Require application source and recompilation Require use of specialized operating systems Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications

Existing checkpoint-restart methods are too limited:

No support for many OS resources that commercial applications use (e.g., sockets)

Limited to applications using specific libraries

Require application source and recompilation

Require use of specialized operating systems

Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications

Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux Application-transparent: supports applications without modifications or recompilation Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module

Application-transparent: supports applications without modifications or recompilation

Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps)

Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state

Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module

Cruz Overview Builds on Columbia Univ.’s Zap process migration Our Key Extensions Support for migrating networked applications, transparent to communicating peers Enables role in managing servers running commercial applications (e.g., databases) General method for checkpoint-restart of TCP/IP-based distributed applications Also enables efficiencies compared to library-specific approaches

Builds on Columbia Univ.’s Zap process migration

Our Key Extensions

Support for migrating networked applications, transparent to communicating peers

Enables role in managing servers running commercial applications (e.g., databases)

General method for checkpoint-restart of TCP/IP-based distributed applications

Also enables efficiencies compared to library-specific approaches

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Zap (Background) Process migration mechanism Kernel module implementation Virtualization layer groups processes into Pods with private virtual name space Intercepts system calls to expose only virtual identifiers (e.g., vpid) Preserves resource names and dependencies across migration Mechanism to checkpoint and restart pods User and kernel-level state Primarily uses system call handlers File system not saved or restored (assumes a network file system) Linux System calls Zap Linux Pods Applications

Process migration mechanism

Kernel module implementation

Virtualization layer groups processes into Pods with private virtual name space

Intercepts system calls to expose only virtual identifiers (e.g., vpid)

Preserves resource names and dependencies across migration

Mechanism to checkpoint and restart pods

User and kernel-level state

Primarily uses system call handlers

File system not saved or restored (assumes a network file system)

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Migrating Networked Applications Migration must be transparent to remote peers to be useful in server management scenarios Peers, including unmodified clients, must not perceive any change in the IP address of the application Communication state of live connections must be preserved No prior solution for these (including original Zap) Our Solution: Provide unique IP address to each pod that persists across migration Checkpoint and restore the socket control state and socket data buffer state of all live sockets

Migration must be transparent to remote peers to be useful in server management scenarios

Peers, including unmodified clients, must not perceive any change in the IP address of the application

Communication state of live connections must be preserved

No prior solution for these (including original Zap)

Our Solution:

Provide unique IP address to each pod that persists across migration

Checkpoint and restore the socket control state and socket data buffer state of all live sockets

Network Address Migration Pod attached to virtual interface with own IP & MAC addr. Implemented by using Linux’s virtual interfaces (VIFs) IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) Intercept bind() & connect() to ensure pod processes use pod’s IP address Migration: delete VIF on source host & create on new host Migration limited to subnet eth0 [IP-1, MAC-h1] eth0:1 Pod DHCP Server Network DHCP Client 1. ioctl() 2. MAC-p1 3. dhcprequest(MAC-p1) 4. dhcpack(IP-p1)

Pod attached to virtual interface with own IP & MAC addr.

Implemented by using Linux’s virtual interfaces (VIFs)

IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address)

Intercept bind() & connect() to ensure pod processes use pod’s IP address

Migration: delete VIF on source host & create on new host

Migration limited to subnet

Communication State Checkpoint and Restore Communication state: Control: Socket data structure, TCP connection state Data: contents of send and receive socket buffers Challenges in communication state checkpoint and restore: Network stack will continue to execute even after application processes are stopped No system call interface to read or write control state No system call interface to read send socket buffers No system call interface to write receive socket buffers Consistency of control state and socket buffer state

Communication state:

Control: Socket data structure, TCP connection state

Data: contents of send and receive socket buffers

Challenges in communication state checkpoint and restore:

Network stack will continue to execute even after application processes are stopped

No system call interface to read or write control state

No system call interface to read send socket buffers

No system call interface to write receive socket buffers

Consistency of control state and socket buffer state

Communication State Checkpoint Acquire network stack locks to freeze TCP processing Save receive buffers using socket receive system call in peek mode Save send buffers by walking kernel structures Copy control state from kernel structures Modify two sequence numbers in saved state to reflect empty socket buffers Indicate current send buffers not yet written by application Indicate current receive buffers all consumed by application Checkpoint State State for one socket Note : Checkpoint does not change live communication state Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rh St+1 Sh Rt+1 X X receive() direct access direct access Rh Rt . . . St Sh . . . Rt+1 Rh Sh St+1 Timers, Options, etc. Control Recv buffers Send buffers copied_seq rcv_nxt snd_una write_seq Live Communication State

Acquire network stack locks to freeze TCP processing

Save receive buffers using socket receive system call in peek mode

Save send buffers by walking kernel structures

Copy control state from kernel structures

Modify two sequence numbers in saved state to reflect empty socket buffers

Indicate current send buffers not yet written by application

Indicate current receive buffers all consumed by application

Communication State Restore Create a new socket Copy control state in checkpoint to socket structure Restore checkpointed send buffer data using the socket write call Deliver checkpointed receive buffer data to application on demand Copy checkpointed receive buffer data to a special buffer Intercept receive system call to deliver data from special buffer until buffer is emptied Sh State for one socket Control Live Communication State copied_seq rcv_nxt snd_una write_seq St Sh . . . Send buffers Checkpoint State Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rt+1 Sh Rt+1 Rt+1 Sh Timers, Options, etc. Rh Rt Recv data direct update St+1 write() To App by intercepted receive system call

Create a new socket

Copy control state in checkpoint to socket structure

Restore checkpointed send buffer data using the socket write call

Deliver checkpointed receive buffer data to application on demand

Copy checkpointed receive buffer data to a special buffer

Intercept receive system call to deliver data from special buffer until buffer is emptied

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Checkpoint-Restart of Distributed Applications State of processes and messages in channel must be checkpointed and restored consistently Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state Transparently supports TCP/IP-based distributed applications Enables efficiencies compared to library-based implementations Communication Channel Library Library Library Checkpoint Node Processes Node Processes Node Processes TCP/IP TCP/IP TCP/IP

State of processes and messages in channel must be checkpointed and restored consistently

Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel

Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state

Transparently supports TCP/IP-based distributed applications

Enables efficiencies compared to library-based implementations

Checkpoint-Restart of Distributed Applications in Cruz Global communication state saved and restored by saving and restoring TCP communication state for each pod Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart Eliminates O(N 2 ) step to flush channel for capturing messages in flight Eliminates need to re-establish connections at restart Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart Communication Channel Library Library Library Checkpoint Node Pod (processes) Node Pod (processes) Node Pod (processes) TCP/IP TCP/IP TCP/IP

Global communication state saved and restored by saving and restoring TCP communication state for each pod

Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart

Eliminates O(N 2 ) step to flush channel for capturing messages in flight

Eliminates need to re-establish connections at restart

Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart

Consistent Checkpoint Algorithm in Cruz (Illustrative) Algorithm has O(N) complexity (blocking algorithm shown for simplicity) Can be extended to improve robustness and performance, e.g.: Tolerate Agent & Coordinator failures Overlap computation and checkpointing using copy-on-write Allow nodes to continue without blocking for all nodes to complete checkpoint Reduce checkpoint size with incremental checkpoints <checkpoint> Node Pod TCP/IP Library Agent Node Coordinator Node Pod TCP/IP Library Agent Disable pod comm § <done> <continue> Enable pod comm <continue-done> <checkpoint> Disable pod comm Save pod state <done> <continue> Enable pod comm Resume pod <continue-done> Save pod state Resume pod §: using netfilter rules in Linux

Algorithm has O(N) complexity (blocking algorithm shown for simplicity)

Can be extended to improve robustness and performance, e.g.:

Tolerate Agent & Coordinator failures

Overlap computation and checkpointing using copy-on-write

Allow nodes to continue without blocking for all nodes to complete checkpoint

Reduce checkpoint size with incremental checkpoints

Disable pod comm §

Enable pod comm

Disable pod comm

Save pod state

Enable pod comm

Resume pod

Save pod state

Resume pod

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Evaluation Cruz implemented for Linux 2.4.x on x86 Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark Cruz incurs negligible runtime overhead (less than 0.5%) Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable

Cruz implemented for Linux 2.4.x on x86

Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark

Cruz incurs negligible runtime overhead (less than 0.5%)

Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable

Performance Result – Negligible Coordination Overhead Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second

Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes

Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable

Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second

Related Work MetaCluster product from Meiosys Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) Berkeley Labs Checkpoint Restart (BLCR) Kernel-module based checkpoint-restart for single node No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict No support for handling communication state – relies on application or library changes MPVM, CoCheck, LAM-MPI Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier

MetaCluster product from Meiosys

Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications)

Berkeley Labs Checkpoint Restart (BLCR)

Kernel-module based checkpoint-restart for single node

No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict

No support for handling communication state – relies on application or library changes

MPVM, CoCheck, LAM-MPI

Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier

Future Work Many areas for future work, e.g., Improve portability across kernel versions by minimizing direct access to kernel structures Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) Implement performance optimizations to the coordinated checkpoint-restart algorithm Evaluate performance on a wide range of applications and cluster configurations Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)

Many areas for future work, e.g.,

Improve portability across kernel versions by minimizing direct access to kernel structures

Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes)

Implement performance optimizations to the coordinated checkpoint-restart algorithm

Evaluate performance on a wide range of applications and cluster configurations

Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)

Summary Cruz, a practical checkpoint-restart system for Linux No change to applications or to base OS kernel needed Novel mechanisms to support checkpoint-restart of a broader class of applications Migrating networked applications transparent to communicating peers Consistent checkpoint-restart of general TCP/IP-based distributed applications Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management

Cruz, a practical checkpoint-restart system for Linux

No change to applications or to base OS kernel needed

Novel mechanisms to support checkpoint-restart of a broader class of applications

Migrating networked applications transparent to communicating peers

Consistent checkpoint-restart of general TCP/IP-based distributed applications

Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management

http://www.hpl.hp.com/research/dca

Zap Virtualization Groups processes into a POD (Process Domain) that has a private virtual namespace Uses system call interception to expose only virtual identifiers (e.g., virtual pids, virtual IPC identifiers) Virtual identifiers eliminate conflicts with identifiers already in use within the OS on the restarting node All dependent processes (e.g., forked child processes) are assigned to same pod Checkpoint and restart operate on an entire pod, which preserves resource dependencies across checkpoint and restart

Groups processes into a POD (Process Domain) that has a private virtual namespace

Uses system call interception to expose only virtual identifiers (e.g., virtual pids, virtual IPC identifiers)

Virtual identifiers eliminate conflicts with identifiers already in use within the OS on the restarting node

All dependent processes (e.g., forked child processes) are assigned to same pod

Checkpoint and restart operate on an entire pod, which preserves resource dependencies across checkpoint and restart

Zap Checkpoint and Restart Checkpoint: Stops all processes in pod with SIGSTOP Parent-child relationships saved from /proc State of each process is captured by accessing system call handlers and kernel data structures Restart: Original forest of processes recreated in a new pod by forking recursively Each process restores most of its resources using system calls (e.g., open files) Kernel module restores sharing relationships (e.g., shared file descriptors) and other key resources (e.g., socket state)

Checkpoint:

Stops all processes in pod with SIGSTOP

Parent-child relationships saved from /proc

State of each process is captured by accessing system call handlers and kernel data structures

Restart:

Original forest of processes recreated in a new pod by forking recursively

Each process restores most of its resources using system calls (e.g., open files)

Kernel module restores sharing relationships (e.g., shared file descriptors) and other key resources (e.g., socket state)

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary

Zap (Background)

Migrating Networked Applications

Network Address Migration

Communication State Checkpoint and Restore

Checkpoint-Restart of Distributed Applications

Evaluation

Related Work

Future Work

Summary

Performance Result – Impact of Dropping Packets at Checkpoint Benchmark streaming data at maximum rate over a GigE link between 2 nodes Shows TCP recovers peak throughput in 100ms Will be overshadowed by checkpoint latency in real applications Optimizations can overlap TCP recovery entirely with checkpointing

Benchmark streaming data at maximum rate over a GigE link between 2 nodes

Shows TCP recovers peak throughput in 100ms

Will be overshadowed by checkpoint latency in real applications

Optimizations can overlap TCP recovery entirely with checkpointing

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Cruz: Application-Transparent Distributed Checkpoint ...

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti1 ...
Read more

Cruz: Application-Transparent Distributed Checkpoint ...

We present a new distributed checkpoint-restart ... Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems", ...
Read more

Cruz: Application-Transparent Distributed Checkpoint ...

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems ... We present a new distributed checkpoint-restart ...
Read more

Cruz: Application-Transparent Distributed Checkpoint ...

Click here for full text: Cruz: Application-Transparent Distributed Checkpoint- Restart on Standard Operating Systems. Janakiraman, G. (John); Santos, Jose ...
Read more

Cruz: Application-Transparent Distributed Checkpoint ...

Publication » Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems.
Read more

www.computer.org

... Yoshio Turner, "Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems," 2014 44th ... Dependable Systems and ...
Read more

application-transparent distributed checkpointrestart on ...

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G . (J o h n ) J a n a k ir a m a n , J o s e R e n a to S ...
Read more

Application-transparent distributed checkpoint-restart on ...

... Application-transparent distributed checkpoint-restart on standard operating ... of the operating system that decouples a distributed ...
Read more

Application-Transparent Distributed Checkpoint-Recovery ...

Application-Transparent Distributed Checkpoint-Recovery for ... Cruz: Application-transparent distributed checkpoint-restart on standard operating systems.
Read more