NFSv4 Replication for Grid Computing

100 %
0 %
Information about NFSv4 Replication for Grid Computing

Published on May 15, 2007

Author: peterhoneyman

Source: slideshare.net

Description

We develop a consistent mutable replication extension for NFSv4 tuned to meet the rigorous demands of large-scale data sharing in global collaborations. The system uses a hierarchical replication control protocol that dynamically elects a primary server at various granularities. Experimental evaluation indicates a substantial performance advantage over a single server system. With the introduction of the hierarchical replication control, the overhead of replication is negligible even when applications mostly write and replication servers are widely distributed.

NFSv4 Replication for Grid Computing Peter Honeyman Center for Information Technology Integration University of Michigan, Ann Arbor

Acknowledgements Joint work with Jiaying Zhang UM CSE doctoral candidate Defending later this month Partially supported by NSF/NMI GridNFS DOE/SciDAC Petascale Data Storage Institute Network Appliance, Inc. IBM ARC

Joint work with Jiaying Zhang

UM CSE doctoral candidate

Defending later this month

Partially supported by

NSF/NMI GridNFS

DOE/SciDAC Petascale Data Storage Institute

Network Appliance, Inc.

IBM ARC

Outline Background Consistent replication Fine-grained replication control Hierarchical replication control Evaluation Durability revisited NEW! Conclusion SKIP SKIP SKIP SKIP SKIP SKIP SKIP SKIP

Background

Consistent replication

Fine-grained replication control

Hierarchical replication control

Evaluation

Durability revisited NEW!

Conclusion

Grid computing Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient SKIP SKIP SKIP Grid Computing

Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient

GridFTP Advantages Automatic negotiation of TCP options Parallel data transfer Integrated Grid security Easy to install and support across a broad range of platforms Drawbacks Data sharing requires manual synchronization SKIP SKIP SKIP

Advantages

Automatic negotiation of TCP options

Parallel data transfer

Integrated Grid security

Easy to install and support across a broad range of platforms

Drawbacks

Data sharing requires manual synchronization

NFSv4 Advantages Traditional, well-understood file system semantics Supports multiple security mechanisms Close-to-open consistency Reader is is guaranteed to see data written by the last writer to close the file Drawbacks Wide-area performance SKIP SKIP SKIP

Advantages

Traditional, well-understood file system semantics

Supports multiple security mechanisms

Close-to-open consistency

Reader is is guaranteed to see data written by the last writer to close the file

Drawbacks

Wide-area performance

NFSv4.r Research prototype developed at CITI Replicated file system build on NFSv4 Server-to-server replication control protocol High performance data access Conventional file system semantics SKIP SKIP SKIP

Research prototype developed at CITI

Replicated file system build on NFSv4

Server-to-server replication control protocol

High performance data access

Conventional file system semantics

Replication in practice Read-only replication Clumsy manual release model Lacks complex data sharing (concurrent writes) Optimistic replication Inconsistent consistency SKIP SKIP SKIP

Read-only replication

Clumsy manual release model

Lacks complex data sharing (concurrent writes)

Optimistic replication

Inconsistent consistency

Consistent replication Problem: state of the practice in file system replication does not satisfy the requirements of global scientific collaborations How can we provide Grid applications efficient and reliable data access? Consistent replication SKIP SKIP SKIP

Problem: state of the practice in file system replication does not satisfy the requirements of global scientific collaborations

How can we provide Grid applications efficient and reliable data access?

Consistent replication

Design principles Optimal read-only behavior Performance must be identical to un-replicated local system Concurrent write behavior Ordered writes, i.e., one-copy serializability Close-to-open semantics Fine-grained replication control The granularity of replication control is a single file or directory SKIP SKIP SKIP

Optimal read-only behavior

Performance must be identical to un-replicated local system

Concurrent write behavior

Ordered writes, i.e., one-copy serializability

Close-to-open semantics

Fine-grained replication control

The granularity of replication control is a single file or directory

Replication control client When a client opens a file for writing, the selected server temporarily becomes the primary for that file Other replication servers are instructed to forward client requests for that file to the primary if concurrent writes occur SKIP SKIP SKIP wopen

Replication control client The primary server asynchronously distributes updates to other servers during file modification SKIP SKIP SKIP write

Replication control client When the file is closed and all replication servers are synchronized, the primary server notifies the other replication servers that it is no longer the primary server for the file SKIP SKIP SKIP close

Directory updates Prohibit concurrent updates A replication server waits for the primary to relinquish its role Atomicity for updates that involve multiple objects (e.g. rename) A server must become primary for all objects Updates are grouped and processed together SKIP SKIP SKIP

Prohibit concurrent updates

A replication server waits for the primary to relinquish its role

Atomicity for updates that involve multiple objects (e.g. rename)

A server must become primary for all objects

Updates are grouped and processed together

Close-to-open semantics Server becomes primary after it collects votes from a majority of replication servers Use a majority consensus algorithm Cost is dominated by the median RTT from the primary server to other replication servers Primary server must ensure that every replication server has acknowledged its election when a written file is closed Guarantees close-to-open semantics Heuristic: allow a new file to inherit the primary server that controls its parent directory for file creation SKIP SKIP SKIP

Server becomes primary after it collects votes from a majority of replication servers

Use a majority consensus algorithm

Cost is dominated by the median RTT from the primary server to other replication servers

Primary server must ensure that every replication server has acknowledged its election when a written file is closed

Guarantees close-to-open semantics

Heuristic: allow a new file to inherit the primary server that controls its parent directory for file creation

Durability guarantee “ Active view” update policy Every server keeps track of the liveness of other servers (active view) Primary server removes from its active view any server that fails to respond to its request Primary server distributes updates synchronously and in parallel Primary server acknowledges a client write after a majority of replication servers reply Primary sends other servers its active view with file close A failed replication server must synchronize with the up-to-date copy before it can rejoin the active group I suppose this is expensive SKIP SKIP SKIP

“ Active view” update policy

Every server keeps track of the liveness of other servers (active view)

Primary server removes from its active view any server that fails to respond to its request

Primary server distributes updates synchronously and in parallel

Primary server acknowledges a client write after a majority of replication servers reply

Primary sends other servers its active view with file close

A failed replication server must synchronize with the up-to-date copy before it can rejoin the active group

I suppose this is expensive

What I skipped Not the Right Stuff GridFTP: manual synchronization NFSv4.r: write-mostly WAN performance AFS, Coda, et al.: sharing semantics Consistent replication for Grid computing Ordered writes too weak Strict consistency too strong Open-to-close just right

Not the Right Stuff

GridFTP: manual synchronization

NFSv4.r: write-mostly WAN performance

AFS, Coda, et al.: sharing semantics

Consistent replication for Grid computing

Ordered writes too weak

Strict consistency too strong

Open-to-close just right

NFSv4.r in brief View-based replication control protocol Based on (provably correct) El-Abbadi, Skeen, and Cristian Dynamic election of primary server At the granularity of a single file or directory Majority consensus on open (for synchronization) Synchronous updates to a majority (for durability) Total consensus on close (for close-to-open)

View-based replication control protocol

Based on (provably correct) El-Abbadi, Skeen, and Cristian

Dynamic election of primary server

At the granularity of a single file or directory

Majority consensus on open (for synchronization)

Synchronous updates to a majority (for durability)

Total consensus on close (for close-to-open)

Write-mostly WAN performance Durability overhead Synchronous updates Synchronization overhead Consensus management

Durability overhead

Synchronous updates

Synchronization overhead

Consensus management

Asynchronous updates Consensus requirement delays client updates Median RTT between the primary server and other replication servers is costly Synchronous write performance is worse Solution: asynchronous update Let application decide whether to wait for server recovery or regenerate the computation results OK for Grid computations that checkpoint Revisit at end with new ideas

Consensus requirement delays client updates

Median RTT between the primary server and other replication servers is costly

Synchronous write performance is worse

Solution: asynchronous update

Let application decide whether to wait for server recovery or regenerate the computation results

OK for Grid computations that checkpoint

Revisit at end with new ideas

Hierarchical replication control Synchronization is costly over WAN Hierarchical replication control Amortizes consensus management A primary server can assert control at different granularities

Synchronization is costly over WAN

Hierarchical replication control

Amortizes consensus management

A primary server can assert control at different granularities

Shallow & deep control /usr bin local /usr bin local A server with a shallow control on a file or directory is the primary server for that single object A server with a deep control on a directory is the primary server for everything in the subtree rooted at that directory

Primary server election Allow deep control for a directory D if D has no descendent is controlled by another server Grant a shallow control request for object L from peer server P if L is not controlled by a server other than P Grant a deep control request for directory D from peer server P if D is not controlled by a server other than P No descendant of D is controlled by a server other than P SKIP SKIP SKIP

Allow deep control for a directory D if D has no descendent is controlled by another server

Grant a shallow control request for object L from peer server P if

L is not controlled by a server other than P

Grant a deep control request for directory D from peer server P if

D is not controlled by a server other than P

No descendant of D is controlled by a server other than P

Ancestry table /root a b c f2 d2 controlled by S1 controlled by S0 controlled by S0 controlled by S2 …… Ancestry Table The data structure of entries in the ancestry table d1 f1 Ancestry Entry an ancestry entry has the following attributes id = unique identifier of the directory array of counters = set of counters recording which servers controls the directory’s descendants counter array S0 S1 S2 Id 2 0 0 c 0 0 1 b 2 1 0 a 2 1 1 root

Primary election S0 and S1 succeed in their primary server elections S2’s election fails due to conflicts Solution - S2 then re-tries by asking for shallow control of a a b c S0 S1 S2 control b control c control b deep control a control c deep control a S0 S1 S2  SKIP SKIP SKIP

S0 and S1 succeed in their primary server elections

S2’s election fails due to conflicts

Solution - S2 then re-tries by asking for shallow control of a

Performance vs. concurrency Associate a timer with deep control Reset the timer with subsequent updates Release deep control when timer expires A small timer value captures bursty updates Issue a separate shallow control for a file written under a deep controlled directory Still process the write request immediately Subsequent writes on the file do not reset the timer of the deep controlled directory SKIP SKIP SKIP SKIP

Associate a timer with deep control

Reset the timer with subsequent updates

Release deep control when timer expires

A small timer value captures bursty updates

Issue a separate shallow control for a file written under a deep controlled directory

Still process the write request immediately

Subsequent writes on the file do not reset the timer of the deep controlled directory

Performance vs. concurrency Increase concurrency when the system consists of multiple writers Send a revoke request upon concurrent writes The primary server shortens releasing timer Optimally issues a deep control request for a directory that contains many updates in single writer cases SKIP SKIP SKIP

Increase concurrency when the system consists of multiple writers

Send a revoke request upon concurrent writes

The primary server shortens releasing timer

Optimally issues a deep control request for a directory that contains many updates in single writer cases

Single remote NFS N.B.: log scale

Deep vs. shallow Shallow controls vs. deep + shallow controls

Deep control timer

Durability revisited Synchronization is expensive, but … When we abandon the durability guarantee, we risk losing the results of the computation And may be forced to rerun it But it might be worth it Goal: maximize utilization NEW NEW NEW

Synchronization is expensive, but …

When we abandon the durability guarantee, we risk losing the results of the computation

And may be forced to rerun it

But it might be worth it

Goal: maximize utilization

Utilization tradeoffs Adding synchronous replication servers enhances durability Which reduces the risk that results are lost And that the computation must be restarted Which benefits utilization But increases run time Which reduces utilization

Adding synchronous replication servers enhances durability

Which reduces the risk that results are lost

And that the computation must be restarted

Which benefits utilization

But increases run time

Which reduces utilization

Placement tradeoffs Nearby replication servers reduce the replication penalty Which benefits utilization Nearby replication servers are more vulnerable to correlated failure Which reduces utilization

Nearby replication servers reduce the replication penalty

Which benefits utilization

Nearby replication servers are more vulnerable to correlated failure

Which reduces utilization

Run-time model

Parameters F: failure free, single server run time C: replication overhead R: recovery time p fail : server failure p recover : successful recovery

F: failure free, single server run time

C: replication overhead

R: recovery time

p fail : server failure

p recover : successful recovery

F: run time Failure-free, single server run time Can be estimated or measured Our focus is on 1 to 10 days

Failure-free, single server run time

Can be estimated or measured

Our focus is on 1 to 10 days

C: replication overhead Penalty associated with replication to backup servers Proportional to RTT Ratio can be measured by running with a backup server a few msec away

Penalty associated with replication to backup servers

Proportional to RTT

Ratio can be measured by running with a backup server a few msec away

R: recovery time Time to detect failure of the primary server and switch to a backup server We assume R << F Arbitrary realistic value: 10 minutes

Time to detect failure of the primary server and switch to a backup server

We assume R << F

Arbitrary realistic value: 10 minutes

Failure distributions Estimated by analyzing PlanetLab ping data 716 nodes, 349 sites, 25 countries All-pairs, 15 minute interval From January 2004 to June 2005 692 nodes were alive throughout We ascribe missing pings to node failure and network partition

Estimated by analyzing PlanetLab ping data

716 nodes, 349 sites, 25 countries

All-pairs, 15 minute interval

From January 2004 to June 2005

692 nodes were alive throughout

We ascribe missing pings to node failure and network partition

PlanetLab failure CDF

Same-site correlated failures sites nodes 11 21 65 259 0.488 5 0.488 0.378 4 0.538 0.440 0.546 3 0.561 0.552 0.593 0.526 2

Different-site correlated failures

Run-time model Discrete event simulation yields expected run time E and utilization (F ÷ E)

Discrete event simulation yields expected run time E and utilization (F ÷ E)

Simulated utilization F = one hour One backup server Four backup servers

Simulation results F = one day One backup server Four backup servers

Simulation results F = ten days One backup server Four backup servers

Simulation results discussion For long-running jobs Replication improves utilization Distant servers improve utilization For short jobs Replication does not improve utilization In general, multiple backup servers don’t help much Implications for checkpoint interval …

For long-running jobs

Replication improves utilization

Distant servers improve utilization

For short jobs

Replication does not improve utilization

In general, multiple backup servers don’t help much

Implications for checkpoint interval …

Checkpoint interval F = one day One backup server 20% checkpoint overhead F = ten days, 2% checkpoint overhead One backup server Four backup servers

Next steps Checkpoint overhead? Replication overhead? Depends on amount of computation We measure < 10% for NAS Grid Benchmarks, which do no computation Refine model Account for other failures Because they are common Other model improvements

Checkpoint overhead?

Replication overhead?

Depends on amount of computation

We measure < 10% for NAS Grid Benchmarks, which do no computation

Refine model

Account for other failures

Because they are common

Other model improvements

Conclusions Conventional wisdom holds that consistent mutable replication in large-scale distributed systems is too expensive to consider Our study proves otherwise

Conventional wisdom holds that consistent mutable replication in large-scale distributed systems is too expensive to consider

Our study proves otherwise

Conclusions Consistent replication in large-scale distributed storage systems is feasible and practical Superior performance Rigorous adherence to conventional file system semantics Improves cluster utilization

Consistent replication in large-scale distributed storage systems is feasible and practical

Superior performance

Rigorous adherence to conventional file system semantics

Improves cluster utilization

Thank you for your attention! www.citi.umich.edu Questions?

Add a comment

Related presentations

Related pages

NFSv4 replication for grid storage middleware - dl.acm.org

NFSv4 replication for grid storage middleware. ... we present a grid computing platform that provides experimental scientists and analysts with access ...
Read more

NFSv4 Replication for Grid Storage Middleware

NFSv4 Replication for Grid Storage Middleware Jiaying Zhang Center for Information Technology Integration ... File System, NFSv4, Grid computing. 1.
Read more

NFSv4 Replication for Grid Storage Middleware

NFSv4 Replication for Grid Storage Middleware ... NFSv4, Grid computing. ... replication with explicit consistency guarantees, ...
Read more

A replicated file system for Grid computing

Grid computing Jiaying Zhang and Peter Honeyman ... In this section, we describe the design of a mutable replication protocol for NFSv4 that guarantees
Read more

CiteSeerX — NFSv4 Replication for Grid Storage Middleware

NFSv4 Replication for Grid Storage Middleware ... in Proc 4 th Intl. Workshop on Middleware for Grid Computing: Citations: 1 ... Replication for NFSv4 ...
Read more

A Replicated File System for Grid Computing

A Replicated File System for Grid Computing Jiaying Zhang ... above example illustrates. To provide such a guarantee, our replication extension to NFSv4
Read more

CiteSeerX — NFSv4 Replication for Grid Storage Middleware

NFSv4 Replication for Grid Storage Middleware (2006) Cached. ... {in Proc 4 th Intl. Workshop on Middleware for Grid Computing}, year = {2006}} ...
Read more

A replicated file system for Grid computing | DeepDyve

Read "A replicated file system for Grid computing" on ... we present a replication scheme for NFSv4 that supports mutable replication without sacrificing ...
Read more

NFSv4 Replication for Grid Storage Middleware - 豆丁网

NFSv4 Replication for Grid Storage Middleware. ... Global name space, Distributed File System, NFSv4, Grid computing. scientificcommunity increasingdemand ...
Read more