the Paxos Commit algorithm

50 %
50 %
Information about the Paxos Commit algorithm

Published on May 25, 2007

Author: paolos84

Source: slideshare.net

Description

This is the presentation I used to give a seminar about the "Paxos Commit" algorithm. It is one of Leslie Lamport's works (in this case, a joint work between him and Jim Gray). You can find the original paper here:
http://research.microsoft.com/users/lamport/pubs/pubs.html#paxos-commit

Feel free to post comments ;)
Enjoy.

 

Agenda Paxos Commit Algorithm: Overview The participating processes The resource managers The leader The acceptors Paxos Commit Algorithm: the base version Failure scenarios Optimizations for Paxos Commit Performance Paxos Commit vs. Two-Phase Commit Using a dynamic set of resource managers

Paxos Commit Algorithm: Overview

The participating processes

The resource managers

The leader

The acceptors

Paxos Commit Algorithm: the base version

Failure scenarios

Optimizations for Paxos Commit

Performance

Paxos Commit vs. Two-Phase Commit

Using a dynamic set of resource managers

Paxos Commit Algorithm: Overview Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it “ Not-synchronous” Commit algorithm Fault-tolerant (unlike 2PC) Intended to be used in systems where failures are fail-stop only, for both processes and network Safety is guaranteed (unlike 3PC) Formally specified and checked Can be optimized to the theoretically best performance

Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit

One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it

“ Not-synchronous” Commit algorithm

Fault-tolerant (unlike 2PC)

Intended to be used in systems where failures are fail-stop only, for both processes and network

Safety is guaranteed (unlike 3PC)

Formally specified and checked

Can be optimized to the theoretically best performance

Participants: the resource managers N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit) Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total). Each RM is the first proposer in its own instance of Paxos Participants: the leader Coordinates the commit algorithm All the instances of Paxos share the same leader It is not a single point of failure (unlike 2PC) Assumed always defined (true, many leader-(s)election algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it)

N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit)

Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total).

Each RM is the first proposer in its own instance of Paxos

Coordinates the commit algorithm

All the instances of Paxos share the same leader

It is not a single point of failure (unlike 2PC)

Assumed always defined (true, many leader-(s)election algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it)

Participants: the acceptors A denotes the set of acceptors All the instances of Paxos share the same set A of acceptors 2F+1 acceptors involved in order to achieve tolerance to F failures We will consider only F+1 acceptors, leaving F more for “spare” purposes (less communication overhead) Each acceptors keep track of its own progress in a Nx1 vector Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s) p p p p p AC 4 AC 5 AC 1 AC 2 AC 3 Consensus box (MS) RM1 a Ok! RM2 p Ok! RM3 p Ok! a a a a a p p p p p 3 rd instance 1 st instance 2 nd instance Acc 1 Acc 2 Acc 3 Acc 4 Acc 5 aState Paxos

A denotes the set of acceptors

All the instances of Paxos share the same set A of acceptors

2F+1 acceptors involved in order to achieve tolerance to F failures

We will consider only F+1 acceptors, leaving F more for “spare” purposes (less communication overhead)

Each acceptors keep track of its own progress in a Nx1 vector

Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s)

Paxos Commit (base) Not blocked iff F acceptors respond T 1 T 2 (N=5) (F=2) : Writes on log AC0 L AC1 AC2 RM1 RM2 RM3 RM4 RM0 prepare (N-1) x p2a rm 0 v(rm) (N(F+1)-1) x rm 0 v(rm) rm 0 v(rm) p2b acc rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) Opt. F x If ( Global Commit ) then commit p3 else abort p3 x N 0 0 v(0) p2a 1x BeginCommit

Global Commit Condition That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘ p ’ as a value and refer to the same ballot p2b acc rm b p

That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘ p ’ as a value and refer to the same ballot

[T 1 ] What if some RMs do not submit their LCV? Leader One majority of acceptors Leader : «Has resource manager j ever proposed you a value?» (Promise not to answer any b L2 <b L1 ) “ accept?” “ promise” “ prepare?” p1a p1b p2a (1) Acceptor i : «Yes, in my last session (ballot) b i with it I accepted its proposal v i » (2) Acceptor i : «No, never» If (at least |MS| acceptors answered) If (for ALL of them case (2) holds) then V=‘a’ [FREE] else V=v(maximum({b i }) [FORCED] Leader : «I am j, I propose V» b L1 >0

[T 2 ] What if the leader fails? If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety (  3PC), but can impede progress… trusted trusted trusted ignored ignored ignored MS b 1 >0 b 2 > b 1 Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders Not really likely to happen It can be avoided (random T ?) b 3 > b 2 b 4 > b 3 T T T L2 L1 trusted

If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety (  3PC), but can impede progress…

Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders

Not really likely to happen

It can be avoided (random T ?)

Optimizations for Paxos Commit (1) Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader -1 message phase (BeginCommit), -(F+2) messages “ Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time -1 message phase (Prepare), -(N-1) messages RM3 RM0 AC0 p2a BeginCommit L p3 RM1 AC1 p2a RM4 RM2 AC2 p2a RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 Not needed anymore! prepare (N-1) x

Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader

-1 message phase (BeginCommit), -(F+2) messages

“ Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time

-1 message phase (Prepare), -(N-1) messages

Optimizations for Paxos Commit (2) Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) FPC + Co-location + R.T.A. = Optimal Consensus Algorithm p2b RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 p2b p3

Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition

Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC)

FPC + Co-location + R.T.A. = Optimal Consensus Algorithm

Performance If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case? *Not Assuming RMs’ concurrent preparation (slides-like scenario) **Assuming RMs’ concurrent preparation (r.t. constraints needed) N +F +1 N +F +1 N+1 Stable storage writes** 2 2 2 Stable storage write delays** 2FN-2F +3N-3 2NF +3N-1 NF +3N-3 NF+F +3N-1 3N-3 3N-1 Messages* 3 Coloc. 4 4 5 3 4 Message delays* No coloc. Coloc. No coloc. Coloc. No coloc. Faster Paxos Commit Paxos Commit 2PC

If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case?

Paxos Commit vs. 2PC Yes, but… T 1 T 2 TM RM1 Other RMs 2PC from Lamport and Gray’s paper 2PC from the slides of the course … two slightly different versions of 2PC!

Yes, but…

… two slightly different versions of 2PC!

Using a dynamic set of RM You add one process, the registrar , that acts just like another resource manager, despite the following: pad Pad RMs can join the transaction until the Commit Protocol begins The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos: AC 4 AC 5 AC 1 AC 2 AC 3 MS RM1 a Ok! RM2 p Ok! RM3 p Ok! Paxos REG RM1;RM2;RM3 Ok! join join join RM1 RM2 RM3 p2b acc rm b p

You add one process, the registrar , that acts just like another resource manager, despite the following:

pad

Pad

RMs can join the transaction until the Commit Protocol begins

The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos:

Thank You! Questions?

Add a comment

Related presentations

Related pages

Paxos (computer science) - Wikipedia

Google uses the Paxos algorithm in their Chubby distributed lock service in order to keep replicas consistent in case of ... Paxos Commit; Paxos Made Simple;
Read more

Consensus on Transaction Commit - Microsoft Research

Running a Paxos consensus algorithm on the commit/abort decision of each participant yields a transaction commit protocol that uses 2F +1 coordinators ...
Read more

Paxos By Example » Angus Macdonald

A description of the Paxos consensus algorithm, ... such as the commit or rollback decision typically made using a ... 10 thoughts on “ Paxos By Example ...
Read more

What-is-a-simple-explanation-of-the-Paxos-algorithm - Quora

Distributed Systems: What is a simple explanation of the Paxos algorithm? This algorithm: ... Computer scientists call this the two-phase commit.
Read more

The Writings of Leslie Lamport - research.microsoft.com

Byzantine Generals and Transaction Commit Protocols; The Byzantine Generals Problem; ... Leaderless Byzantine Paxos; Euclid Writes an Algorithm: ...
Read more

Consensus on Transaction Commit (Paxos Commit)

How our Paxos Commit algorithm eliminates that message delay is described below. 10 Paxos Commit uses a separate instance of the Paxos consensus algorithm
Read more

Enhanced Paxos Commit for Transactions on DHTs - KOBV

Enhanced Paxos Commit for Transactions on DHTs ... Algorithm 1 Paxos Consensus: Proposer 1: initialize 2: r= any round number greater than all seen before
Read more

Consensus on Transaction Commit

Consensus on Transaction Commit JIM GRAY and LESLIE LAMPORT ... Commit algorithm is obtained as the special F = 0 case of the Paxos Commit algorithm.
Read more