Performance and Availability Tradeoffs in Replicated File Systems

50 %
50 %
Information about Performance and Availability Tradeoffs in Replicated File Systems

Published on May 23, 2008

Author: peterhoneyman

Source: slideshare.net

Description

Presented at Resilience 2008 : Workshop on Resiliency in High Performance Computing (Lyon, May 22, 2008)

Performance and Availability Tradeoffs in Replicated File Systems Peter Honeyman Center for Information Technology Integration University of Michigan, Ann Arbor

Acknowledgements • Joint work with Dr. Jiaying Zhang • Now at Google • This was a chapter of her dissertation • Partially supported by • NSF/NMI GridNFS • DOE/SciDAC Petascale Data Storage Institute • NetApp • IBM ARC

Storage replication • Advantages ☺ • Scalability • Reliability • Read performance

Storage replication • Disadvantages ☹ • Complex synchronization protocols • Concurrency • Durability • Write performance

Durability • If we weaken the durability guarantee, we may lose data ... • And be forced to restart the computation • But it might be worth it

Utilization tradeoffs • Adding replication servers enhances durability • Reduces the risk that computation must be restarted • Increases utilization ☺ • Replication increases run time • Reduces utilization ☹

Placement tradeoffs • Nearby replication servers reduce the replication penalty • Increases utilization ☺ • Nearby replication servers are vulnerable to correlated failure • Reduces utilization ☹

Run-time model recover fail ok fail start run end

Parameters • Failure free, single server run time • Can be estimated or measured • Our focus is on 1 to 10 days

Parameters • Replication overhead • Penalty associated with replication to backup servers • Proportional to RTT • Ratio can be measured by running with a backup server a few msec away

Parameters • Recovery time • Time to detect failure of the primary server and switch to a backup server • Not a sensitive parameter

Parameters • Probability distribution functions • Server failure • Successful recovery

Server failure • Estimated by analyzing PlanetLab ping data • 716 nodes, 349 sites, 25 countries • All-pairs, 15 minute interval, 1/04 to 6/05 • 692 nodes were alive throughout • We ascribe missing pings to node failure and network partition

PlanetLab failure cumulative failure: log-linear scale

Correlated failures failed nodes nodes per site 2 3 4 5 2 0.526 0.593 0.552 0.561 3 0.546 0.440 0.538 4 0.378 0.488 5 0.488 number of sites 259 65 21 11 P(n nodes down | 1 node down)

0.25 Correlated failures Average Failure Correlations 0.20 0.15 0.10 0.05 0 25 75 125 175 RTT (ms) nodes slope y-intercept 2 -2.4 x 10-4 0.195 3 -2.3 x 10-4 0.155 4 -2.3 x 10-4 0.134 5 -2.4 x 10-4 0.119

Run-time model • Discrete event simulation for expected run time and utilization recover fail ok fail start run end

Simulation results one hour no replication: utilization = .995 write intensity 0.0001 0.001 0.01 RTT 0.1 1.0 1.0 0.8 0.8 0.6 0.6 RTT RTT One backup Four backups

Simulation results one day no replication: utilization = .934 write intensity 0.0001 0.001 0.01 RTT 0.1 1.0 1.0 0.8 0.8 0.6 0.6 RTT RTT One backup Four backups

Simulation results ten days no replication: utilization = .668 RTT RTT 1.00 1.00 0.75 0.75 0.50 0.50 RTT RTT One backup Four backups

Simulation discussion • Replication improves utilization for long- running jobs • Multiple backup servers do not improve utilization (due to low PlanetLab failure rates)

Simulation discussion • Distant backup servers improve utilization for light writers • Distant backup servers do not improve utilization for heavy writers • Implications for checkpoint interval …

Checkpoint interval calculated on the back of a napkin one day, 20% checkpoint overhead 10 day, 2% checkpoint overhead 10 day, 2% checkpoint overhead one backup server four backup servers

Work in progress • Realistic failure data • Storage and processor failure • PDSI failure data repository • Realistic checkpoint costs — help! • Realistic replication overhead • Depends on amount of computation • Less than 10% for NAS Grid Benchmarks

Conclusions • Conventional wisdom holds that consistent mutable replication in large-scale distributed systems is too expensive to consider • Our study suggests otherwise

Conclusions • Consistent replication in large-scale distributed storage systems is feasible and practical • Superior performance • Rigorous adherence to conventional file system semantics • Improved utilization

Thank you for your attention! www.citi.umich.edu Questions?

Add a comment

Related pages

Performance and Availability Tradeoffs in Replicated File ...

Performance and Availability Tradeoffs in Replicated File Systems Jiaying Zhang jiayingz@umich.edu Peter Honeyman honey@citi.umich.edu 1. Introduction
Read more

Performance and Availability Tradeoffs in Replicated File ...

Performance and Availability Tradeoffs in Replicated File ... Prefetching is an effective technique for improving file access performance, ... systems are ...
Read more

Performance and Availability Tradeoffs in Replicated File ...

Performance and Availability Tradeoffs in Replicated File Systems. Jiaying Zhang, Peter Honeyman; CCGRID; 2008; View PDF; Cite; Save; Abstract. Replication ...
Read more

CiteSeerX — Performance and Availability Tradeoffs in ...

Performance and Availability Tradeoffs in Replicated File ... A large-scale study of failures in high-performance computing systems ... Availability, usage ...
Read more

Performance and Availability Tradeoffs in Replicated File ...

Performance and Availability Tradeoffs in Replicated File Systems Jiaying Zhang, Jiaying Zhang; Honeyman, P. Replication is a key technique for ...
Read more

Performance and Availability Tradeoffs in Replicated File ...

... Performance and Availability Tradeoffs in Replicated File ... Tradeoffs in Replicated File Systems. ... considerable performance ...
Read more

Performance and Availability Tradeoffs in Replicated File ...

Performance and Availability Tradeoffs in Replicated File Systems. ... and Availability Tradeoffs in Replicated File Systems ... performance under some ...
Read more

CiteSeerX — 1 Performance and Availability Tradeoffs in ...

@MISC{Zhang_1performance, author = {Jiaying Zhang and Peter Honeyman}, title = {1 Performance and Availability Tradeoffs in Replicated File Systems}, year ...
Read more

Exploring Data Reliability Tradeoffs in Replicated Storage ...

Exploring Data Reliability Tradeoffs in Replicated Storage Systems ... durability and availability ... D.4.3 [Operating Systems]: File Systems ...
Read more