50 %
50 %
Information about PrelimII

Published on October 7, 2007

Author: Cannes


A non-blocking coordinated checkpoint protocol:  A non-blocking coordinated checkpoint protocol PhD Candidate: Yijian Yang Committee: Dr. Yuan Shi (Chairman) Dr. Peter Hu Dr. Pei Wang Dr. Henry Sendaula Outline:  Outline Overview Current solutions and their problems SPP solution Performance comparison Conclusion and future work Overview:  Overview What is fault tolerance A property that enables any application to continue operating in the event of multiple component failures in the multiprocessor system. Why do we need fault tolerance? More processors means more points of failure Low-cost, custom-assembled cluster implies higher failure rate than custom multiprocessors. Applications requirement far exceeding MTBF Fault Tolerance Categories:  Fault Tolerance Categories Fault tolerance can be achieved through the following ways Replicated system 2PC Transaction Group communication Rollback recovery Checkpoint based Log based Computation Models:  Computation Models Current systems: MPI (based on message passing) Master and worker contained in the same (stateful) program and are distributed to all nodes. Application fault tolerance must involve all nodes. Proposed system: SPP (based on dataflow) Master (stateful) and worker (stateless) are separate programs. Only worker are automatically distributed to all nodes. Worker fault tolerance is done via low-cost shadow-tuples. Application fault tolerance needs only to protect (stateful) master(s). Proposed solution for SPP:  Proposed solution for SPP System level checkpoint based non-blocking coordinated protocol for multi-master protection Checkpoint: No need for detecting, logging or replaying event Doesn’t rely on PWD Low overhead and fast recovery Coordinated Simplifying recovery Not susceptible to domino effect One permanent checkpoint, no need for garbage collection. Non-blocking Low overhead System level Transparent to programmer and automatic Current solution 1 - Blocking:  Current solution 1 - Blocking Problem with blocking protocol:  Problem with blocking protocol Assume that the network is FIFO Flushing the network before sending the CP request takes time Blocking process when coordinate When fail, all processes have to rollback Current solution 2 – Non – Blocking (Chandy and Lamport algorithm):  Current solution 2 – Non – Blocking (Chandy and Lamport algorithm) Problem with current non-blocking protocol:  Problem with current non-blocking protocol Message replaying is done through the help of the message sender, which requires all processes to rollback to their previous checkpoint when one process fails. SPP Solution:  SPP Solution SPP Solution – Synergy Implementation:  SPP Solution – Synergy Implementation SPP Solution – Synergy Implementation (cont):  SPP Solution – Synergy Implementation (cont) Improvement:  Improvement Non-blocking. SPP enables single process rollback. Performance Study:  Performance Study In order to exam the performance effects of blocking vs. non-blocking, we have the following assumptions: Time used for taking local checkpoints is constant. Work load for each single processor remains the same regarding to different cluster size. Both blocking and non-blocking protocols are implemented using the same underlying library. Experiment Environment:  Experiment Environment Yoda cluster Sun Blade 100 workstation 550-MHz 512MB SDRAM 100Mbps Ethernet Application Matrix multiplication G = 250 (Near optimal) Each node will get 2 chunks (500) during the computation. Near optimal G calculation:  Near optimal G calculation Result 1:  Result 1 Result 2:  Result 2 Conclusions:  Conclusions Proposed non-blocking protocol delivered much lower overhead compared to blocking protocol. Expect better performance during the recovery (not tested yet). The overall fault tolerance overheads can be significantly lower than MPI systems. Future Work:  Future Work Complete implementation of SPP system-level checkpoint and recovery. Performance comparisons with MPI fault tolerance systems. Formal discussions. Bibliography:  Bibliography Shi, Y. 2004 Stateless Parallel Processing Chandy, K. 1985 Distributed snapshots: determining global states of distributed systems Elnozahy, E. 2002 A survey of rollback-recovery protocols in message passing systems Camille, C 2006 Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant MPI Slide23:  Thanks!

Add a comment

Related presentations

Related pages

Math 2930: PrelimII (April12,2012) - Cornell University

Math 2930: PrelimII (April12,2012) YourName: YourTA’sname: YourSection Number and/or day and time: This exam should have 6 pages, with 5 problems (20 ...
Read more

Alien Tissue and Fluid Samples: Preliminary Report II

Alien Tissue and Fluid Samples: Preliminary Report II. Click, look to left in photo, see bottom page for details. 2012-2014. Alien Tissue and Fluid Samples
Read more

Math 2310 Prelim-II Take Home. - Cornell University

Math 2310 Prelim-II Take Home. May 3, 2010 Instructions. This take home is due on Friday, May 7 in class. Grading will be done strictly.
Read more


27/02/2012 ECONOMICS (SOLUTION)(PRELIM-II) 60 mks Sol:-1. A. Assets and Liabilities of commercial bank Balance sheet of Commercial Bank
Read more

28/02/2012 MHRM (SOLUTION)(PRELIM-II) 60 mks Sol:-1 1 ...

28/02/2012 MHRM (SOLUTION)(PRELIM-II) 60 mks Sol:-1 1. Concepts of marketing The Exchange concept The Production concept The Product ...
Read more

Created Date: 7/30/2013 3:03:15 PM
Read more


Read more

Grade X Term-II Prelim-II Timetable year 2015-2016 ...

Please click here to book your appointment with the school for admission enquiry. Universal High Daftary Road, Near Railway Station, Malad East, Mumbai 400 097
Read more


481 J. Electromd. Gem,, 228 (1987) 481-486 Ehvier Sequoia S.A., Lausanne - Printed in The Netherlands Prelimii note THE MIRAGE EFFECT IN ...
Read more

CHEM 3900 : HONORS PHYSICAL CHEM II - Cornell - Course Hero

Here is the best resource for homework help with CHEM 3900 : HONORS PHYSICAL CHEM II at Cornell. Find CHEM3900 study guides, notes, and practice tests from
Read more