Published on March 11, 2014
Building Tomorrow's Ceph Sage Weil
Research beginnings 9
UCSC research grant “Petascale object storage” US Dept of Energy: LANL, LLNL, Sandia Scalability Reliability Performance Raw IO bandwidth, metadata ops/sec HPC file system workloads Thousands of clients writing to same file, directory
Distributed metadata management Innovative design Subtree-based partitioning for locality, efficiency Dynamically adapt to current workload Embedded inodes Prototype simulator in Java (2004) First line of Ceph code Summer internship at LLNL High security national lab environment Could write anything, as long as it was OSS
The rest of Ceph RADOS – distributed object storage cluster (2005) EBOFS – local object storage (2004/2006) CRUSH – hashing for the real world (2005) Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
Click to edit the outline text format Second Outline Level Third Outline Level Fourth Outline Level Fifth Outline Level Sixth Outline Level Seventh Outline Level Eighth Outline Level Ninth Outline LevelClick to edit Master text styles
Industry black hole Many large storage vendors Proprietary solutions that don't scale well Few open source alternatives (2006) Very limited scale, or Limited community and architecture (Lustre) No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
A different path Change the world with open source Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid community un-friendly practices No dual licensing No copyright assignment
DreamHost! Move back to Los Angeles, continue hacking Hired a few developers Pure development No deliverables
Ambitious feature set Native Linux kernel client (2007-) Per-directory snapshots (2008) Recursive accounting (2008) Object classes (2009) librados (2009) radosgw (2009) strong authentication (2009) RBD: rados block device (2010)
The kernel client ceph-fuse was limited, not very fast Build native Linux kernel implementation Began attending Linux file system developer events (LSF) Early words of encouragement from ex-Lustre devs Engage Linux fs developer community as peer Eventually merged CephFS client for v2.6.34 (early 2010) RBD client merged in 2011
Part of a larger ecosystem Ceph need not solve all problems as monolithic stack Replaced ebofs object file system with btrfs Same design goals Robust, well optimized Kernel-level cache management Copy-on-write, checksumming, other goodness Contributed some early functionality Cloning files Async snapshots
Budding community #ceph on irc.oftc.net, email@example.com Many interested users A few developers Many fans Too unstable for any real deployments Still mostly focused on right architecture and technical solutions
Road to product DreamHost decides to build an S3-compatible object storage service with Ceph Stability Focus on core RADOS, RBD, radosgw Paying back some technical debt Build testing automation Code review! Expand engineering team
The reality Growing incoming commercial interest Early attempts from organizations large and small Difficult to engage with a web hosting company No means to support commercial deployments Project needed a company to back it Fund the engineering effort Build and test a product Support users Bryan built a framework to spin out of DreamHost
Do it right How do we build a strong open source company? How do we build a strong open source community? Models? RedHat, Cloudera, MySQL, Canonical, … Initial funding from DreamHost, Mark Shuttleworth
Goals A stable Ceph release for production deployment DreamObjects Lay foundation for widespread adoption Platform support (Ubuntu, Redhat, SuSE) Documentation Build and test infrastructure Build a sales and support organization Expand engineering organization
Branding Early decision to engage professional agency MetaDesign Terms like “Brand core” “Design system” Keep project and company independent Inktank != Ceph The Future of Storage
Click to edit the outline text format Second Outline Level Slick graphics broken powerpoint template 31
Today: adoption 32
Traction Too many production deployments to count We don't know about most of them! Too many customers (for me) to count Expansive partner list Lots of inbound Lots of press and buzz
Quality Increased adoption means increased demands on robust testing Across multiple platforms Upgrades Rolling upgrades Inter-version compatibility
Developer community Significant external contributors Many full-time contributors outside of Inktank First-class feature contributions from contributors Non-Inktank participants in daily stand-ups External access to build/test lab infrastructure Common toolset Github Email (kernel.org) IRC (oftc.net) Linux distros
CDS: Ceph Developer Summit Community process for building project roadmap 100% online Google hangouts Wikis Etherpad Quarterly Our 4th CDS next week Great participation Ongoing indoctrination of Inktank engineers to open development model
Erasure coding Replication for redundancy is flexible and fast For larger clusters, it can be expensive Erasure coded data is hard to modify, but ideal for cold or read-only objects Will be used directly by radosgw Coexists with new tiering capability Storage overhead Repair traffic MTTDL (days) 3x replication 3x 1x 2.3 E10 RS (10, 4) 1.4x 10x 3.3 E13 LRC (10, 6, 5) 1.6x 5x 1.2 E15
Tiering Client side caches are great, but only buy so much. Separate hot and cold data onto different storage devices Promote hot objects into a faster (e.g., flash-backed) cache pool Push cold object back into slower (e.g., erasure-coded) base pool Use bloom filters to track temperature Common in enterprise solutions; not found in open source scale-out systems → new (with erasure coding) in Firefly release
The Future 40
Technical roadmap How do we reach new use-cases and users How do we better satisfy existing users How do we ensure Ceph can succeed in enough markets for supporting organizations to thrive Enough breadth to expand and grow the community Enough focus to do well
Multi-datacenter, geo-replication Ceph was originally designed for single DC clusters Synchronous replication Strong consistency Growing demand Enterprise: disaster recovery ISPs: replication data across sites for locality Two strategies: use-case specific: radosgw, RBD low-level capability in RADOS
RGW: Multi-site and async replication Multi-site, multi-cluster Regions: east coast, west coast, etc. Zones: radosgw sub-cluster(s) within a region Can federate across same or multiple Ceph clusters Sync user and bucket metadata across regions Global bucket/user namespace, like S3 Synchronize objects across zones Within the same region Across regions Admin control over which zones are master/slave
RBD: block devices Today: backup capability Based on block device snapshots Efficiently mirror changes between consecutive snapshots across clusters Now supported/orchestrated by OpenStack Good for coarse synchronization (e.g., hours or days) Tomorrow: data journaling for async mirroring Pending blueprint at next week's CDS Mirror active block device to remote cluster Possibly with some configurable delay
Async replication in RADOS One implementation to capture multiple use-cases RBD, CephFS, RGW, … RADOS A harder problem Scalable: 1000s OSDs → 1000s of OSDs Point-in-time consistency Challenging research problem → Ongoing design discussion among developers
CephFS → This is where it all started – let's get there Today Stabilization of multi-MDS, directory fragmentation, QA NFS, CIFS, Hadoop/HDFS bindings complete but not productized Need Greater QA investment Fsck Snapshots Amazing community effort (Intel, NUDT and Kylin) 2014 is the year
Governance How do we strengthen the project community? 2014 is the year Recognized project leads RBD, RGW, RADOS, CephFS, ... Formalize emerging processes around CDS, community roadmap External foundation?
The larger ecosystem
The enterprise How do we pay for all of this? Support legacy and transitional client/server interfaces iSCSI, NFS, pNFS, CIFS, S3/Swift VMWare, Hyper-V Identify the beachhead use-cases Earn others later Single platform – shared storage resource Bottom-up: earn respect of engineers and admins Top-down: strong brand and compelling product
Why Ceph is the Future of Storage It is hard to compete with free and open source software Unbeatable value proposition Ultimately a more efficient development model It is hard to manufacture community Strong foundational architecture Next-generation protocols, Linux kernel support Unencumbered by legacy protocols like NFS Move from client/server to client/cluster Ongoing paradigm shift Software defined infrastructure, data center Widespread demand for open platforms
Click to edit the outline text format Second Outline Level Thank you, and Welcome!
Ceph Day Frankfurt. Ceph Days In Frankfurt. ... Keynote: Building Tomorrow's Ceph Sage Weil, ... Building AuroraObjects
1. Building Tomorrow's Ceph Sage Weil . 2. Research beginnings 9 . 3. UCSC research grant “Petascale object storage” US Dept of Energy: LANL, LLNL ...
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
... z.B. die Erweiterung der Ceph-Management-Funktionalität in ... This keynote will discuss why GNOME ... Day-to-day experiences will take on ...
... episode of The New Stack Makers alongside ... during Cloud Native Day Toronto 2016. How CoreOS is Building ... creators of Ceph, ...
BSODs at scale: we laugh at your puny five storeys, here's our SIX storey #fail Having offended everyone else in the world, Linus Torvalds calls own ...
Letsencrypt SSL Installation for Hostname – How to guide 1st you need to add the subdomain to your DNS manager and point it to the server, also you need ...
Site Archive for Wednesday, 30 Jan 2013. ... Manganaro Midatlantic Awarded with Sixteenth Washington Building Congress ... prevents tomorrow's ...
Site Archive for Monday, 05 Mar 2012. ... UBM Electronics Editors to Keynote the IEEE Integrated ... Top Facts About Hispanics In Tomorrow's Super Tuesday ...
Full text of "Library Of Congress Catalog 1960-1964 Volume 15" See other formats ...