80 %
20 %
Information about PrelimII

Published on October 7, 2007

Author: Cannes


A non-blocking coordinated checkpoint protocol:  A non-blocking coordinated checkpoint protocol PhD Candidate: Yijian Yang Committee: Dr. Yuan Shi (Chairman) Dr. Peter Hu Dr. Pei Wang Dr. Henry Sendaula Outline:  Outline Overview Current solutions and their problems SPP solution Performance comparison Conclusion and future work Overview:  Overview What is fault tolerance A property that enables any application to continue operating in the event of multiple component failures in the multiprocessor system. Why do we need fault tolerance? More processors means more points of failure Low-cost, custom-assembled cluster implies higher failure rate than custom multiprocessors. Applications requirement far exceeding MTBF Fault Tolerance Categories:  Fault Tolerance Categories Fault tolerance can be achieved through the following ways Replicated system 2PC Transaction Group communication Rollback recovery Checkpoint based Log based Computation Models:  Computation Models Current systems: MPI (based on message passing) Master and worker contained in the same (stateful) program and are distributed to all nodes. Application fault tolerance must involve all nodes. Proposed system: SPP (based on dataflow) Master (stateful) and worker (stateless) are separate programs. Only worker are automatically distributed to all nodes. Worker fault tolerance is done via low-cost shadow-tuples. Application fault tolerance needs only to protect (stateful) master(s). Proposed solution for SPP:  Proposed solution for SPP System level checkpoint based non-blocking coordinated protocol for multi-master protection Checkpoint: No need for detecting, logging or replaying event Doesn’t rely on PWD Low overhead and fast recovery Coordinated Simplifying recovery Not susceptible to domino effect One permanent checkpoint, no need for garbage collection. Non-blocking Low overhead System level Transparent to programmer and automatic Current solution 1 - Blocking:  Current solution 1 - Blocking Problem with blocking protocol:  Problem with blocking protocol Assume that the network is FIFO Flushing the network before sending the CP request takes time Blocking process when coordinate When fail, all processes have to rollback Current solution 2 – Non – Blocking (Chandy and Lamport algorithm):  Current solution 2 – Non – Blocking (Chandy and Lamport algorithm) Problem with current non-blocking protocol:  Problem with current non-blocking protocol Message replaying is done through the help of the message sender, which requires all processes to rollback to their previous checkpoint when one process fails. SPP Solution:  SPP Solution SPP Solution – Synergy Implementation:  SPP Solution – Synergy Implementation SPP Solution – Synergy Implementation (cont):  SPP Solution – Synergy Implementation (cont) Improvement:  Improvement Non-blocking. SPP enables single process rollback. Performance Study:  Performance Study In order to exam the performance effects of blocking vs. non-blocking, we have the following assumptions: Time used for taking local checkpoints is constant. Work load for each single processor remains the same regarding to different cluster size. Both blocking and non-blocking protocols are implemented using the same underlying library. Experiment Environment:  Experiment Environment Yoda cluster Sun Blade 100 workstation 550-MHz 512MB SDRAM 100Mbps Ethernet Application Matrix multiplication G = 250 (Near optimal) Each node will get 2 chunks (500) during the computation. Near optimal G calculation:  Near optimal G calculation Result 1:  Result 1 Result 2:  Result 2 Conclusions:  Conclusions Proposed non-blocking protocol delivered much lower overhead compared to blocking protocol. Expect better performance during the recovery (not tested yet). The overall fault tolerance overheads can be significantly lower than MPI systems. Future Work:  Future Work Complete implementation of SPP system-level checkpoint and recovery. Performance comparisons with MPI fault tolerance systems. Formal discussions. Bibliography:  Bibliography Shi, Y. 2004 Stateless Parallel Processing Chandy, K. 1985 Distributed snapshots: determining global states of distributed systems Elnozahy, E. 2002 A survey of rollback-recovery protocols in message passing systems Camille, C 2006 Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant MPI Slide23:  Thanks!

Add a comment

Related presentations

Related pages

Alien Tissue and Fluid Samples: Preliminary Report II

Alien Tissue and Fluid Samples: Preliminary Report II. Click, look to left in photo, see bottom page for details. 2012-2014. Alien Tissue and Fluid Samples
Read more

Math 2930: PrelimII (April12,2012)

Math 2930: PrelimII (April12,2012) YourName: YourTA’sname: YourSection Number and/or day and time: This exam should have 6 pages, with 5 problems (20 ...
Read more

PrelimII(1).pdf - ECE 3250 PRELIM II Fall 2015 Each part ...

View PrelimII(1).pdf from ECE 3250 at Cornell. ECE 3250 PRELIM II Fall 2015 Each part of Problems 1 through 9 is worth 2 points. Problem 10 is worth 18
Read more

Prelimii note -

169 J. Electroanal. Chem.. 220 (1987) 169-172 Elsevier Sequoia S.A., Lausanne - Printed in The Netherlands Prelimii note NUCLEATION RINGS AROUND ...
Read more

Chem_3900_prelimII_solutions - Prelim II Chem3900:Phvsical ...

View Notes - Chem_3900_prelimII_solutions from CHEM 3900 at Cornell. Prelim II Chem3900:Phvsical Chemistrv ll 2009 Spring 10 March
Read more

The Preliminary Course of Training in Thai Theatrical Art ...

the prelimii{ary g(|urse (|f traiiiii{g iii thai theatricat art. by. laa dhanit yupho
Read more


Prelimii results in both cases show differences between spectra for normal and diabetic patients, and increase in fluorescence with the onset of ...
Read more

Math 2310 Prelim-II Take Home.

Math 2310 Prelim-II Take Home. May 3, 2010 Instructions. This take home is due on Friday, May 7 in class. Grading will be done strictly.
Read more

Grade X Term-II Prelim-II Timetable year 2015-2016 ...

Universal High School DLB Lane, Off Daftary Road, Near Railway Station, Malad East, Mumbai 400 097 Call : +91 80 800 4060 2 / 3 /4 Email : info.malad ...
Read more

Patent US428056 - Island - Google Patente

(No Model.) 3 SheetsSheet 1. G. B. SMITH. MAKING SEAMLBSS GOLD PLATED WIRE. vPeflzenfced May 13, 1890. INVENTUFM WITNESSES- (No Model) 3 Sheets-Sheet 2.
Read more