Published on March 11, 2014
Scaling Ceph at CERN Dan van der Ster (firstname.lastname@example.org) Data and Storage Service Group | CERN IT Department
CERN’s Mission and Tools ● CERN studies the fundamental laws of nature ○ Why do particles have mass? ○ What is our universe made of? ○ Why is there no antimatter left? ○ What was matter like right after the “Big Bang”? ○ … ● The Large Hadron Collider (LHC) ○ Built in a 27km long tunnel, ~200m underground ○ Dipole magnets operated at -271°C (1.9K) ○ Particles do ~11’000 turns/sec, 600 million collisions/sec ○ … ● Detectors ○ Four main experiments, each the size of a cathedral ○ DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3
Big Data at CERN Physics Data on CASTOR/EOS ● LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS ● Home directories for 30k users ● Physics analysis development ● Project spaces for applications Service Data on AFS/NFS ● Databases, admin applications Tape archival with CASTOR/TSM ● RAW physics outputs ● Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M
IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERN’s IT infrastructure ... ● Centrally-managed and uniform hardware ○ No more service-specific storage boxes ● OpenStack VMs for most services ○ Building for 100k nodes (mostly for batch processing) ● Attractive desktop storage services ○ Huge demand for a local Dropbox, Google Drive … ● Remote data centre in Budapest ○ More rack space and power, plus disaster recovery … brings new storage requirements ● Block storage for OpenStack VMs ○ Images and volumes ● Backend storage for existing and new services ○ AFS, NFS, OwnCloud, Data Preservation, ... ● Regional storage ○ Use of our new data centre in Hungary ● Failure tolerance, data checksumming, easy to operate, security, ...
Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6
12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors # df -‐h /mnt/ceph Filesystem Size Used Avail Use% Mounted on xxx:6789:/ 3.1P 173T 2.9P 6% /mnt/ceph
Use-Cases Being Evaluated 1. Images and Volumes for OpenStack 2. S3 Storage for Data Preservation / Public Dissemination 3. Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment.
OpenStack Volumes & Images • Glance: using RBD for ~3 months now. • Only issue was to increase ulimit -n above 1024 (10k is good). • Cinder: testing with close colleagues. • 126 Cinder Volumes attached today – 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)
RBD for OpenStack Volumes • Before general availability, we need to test and enable qemu iops/bps throttling • Otherwise VMs with many IOs can disrupt other users. • One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. • Happens on VMs with many attached RBD’s. • Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11
CASTOR & XRootD/EOS • Exploring RADOS backend for these two HEP-developed file systems • Gateway model, similar to S3 via RADOSGW • CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). • Striped RWs across many OSDs are important. • XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects • Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. • Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12
Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13
Configuration and Deployment • Dumpling 0.67.7 • Fully Puppet-ized • Automated server deployment, automated OSD replacement • Very few custom ceph.conf options à • Experimenting with the filestore wbthrottle • we find that disabling it completely gives better IOps performance • But don’t do this!!! Scaling Ceph at CERN - D. van der Ster 14 mon osd down out interval = 900 osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 1024 osd pool default pgp num = 1024 osd pool default flag hashpspool = true osd max backfills = 1 osd recovery max active = 1
Cluster Activity Scaling Ceph at CERN - D. van der Ster 15
General Comments… • In these ~7 months of running the cluster, there have been very few problems • No outages • No data losses/corruptions • No unfixable performance issues • Behaves well during stress tests • But now we’re starting to get real/varied/creative users, and this brings up many interesting issues... • “No amount of stress testing can prepare you for real users” - Unknown • (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16
Latency & Slow Requests • Best latency we can achieve is 20-40ms • Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) • Latency can increase dramatically with heavy usage • Don’t mix latency-bound and throughput-bound users on the same OSDs • Local processes scanning the disks can hurt performance • Add /var/lib/ceph to the updatedb PRUNEPATH • If you have slow disks like us, you need to understand your disk IO scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! • Scrubbing! • Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming… • “Something is flushing the buffers, blocking the OSD processes” • Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17
Life with 250 million objects • Recently, a user decided to write 250 million 1kB objects • Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects • It worked – no big problems from holding this many objects. • Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that we’re trying to understand. • But now we want to cleanup, and it is not trivial to remove 250M objects! • rados rmpool generated quite a load when we rm’d a 3 million object pool (some OSDs were temporarily marked down). • Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18
Other backfilling issues • During a backfilling event (draining a whole server), we started observing repeated monitor elections • Caused by the mons’ LevelDBs being so active that the local SATA disks couldn’t keep up. • When a mon falls behind, it calls an election • Could be due to LevelDB compaction… • We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling • Avoid double backfilling when taking an OSD out of service: • Start with ceph osd crush rm <osd id> !! • If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19
Fun with CRUSH • CRUSH is simple yet powerful, so it is tempting to play with the cluster layout • But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. • Early CRUSH planning is crucial! • A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? • But (assuming we don’t have a private cluster network) that would send all replication traffic via the switch uplinks – bottleneck! • Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20
CRUSH & Data distribution • CRUSH may give your cluster an uneven data distribution • An OSD’s used space will scale with the number of PGs assigned to it • After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions • reweight-‐by-‐utilization is useful to iron out an uneven PG distribution • The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes)
RBD Reliability with 3 Replicas • RBD devices are chunked across thousands of objects: • A full 1TB volume is composed of 250,000 4MB objects • If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) • If you lose an entire PG, you can consider all RBDs to be lost / corrupted. • Our incorrect & irrational fears: • Any simultaneous triple disk failure in the cluster would lead to objects being lost – and somehow all RBDs would be corrupted. • As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. • But this is wrong!! • The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. • N_PGs * ~(P_diskfailure^3) / 3! • We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22
Trust your clients • There is no server-side per-client throttling • A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. • When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client • Could use some help in the monitoring: per-client perf stats? • One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic • Could saturate the mon network, lead to disruptions • RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User. Scaling Ceph at CERN - D. van der Ster 23
Fat fingers • A healthy cluster is always vulnerable to human errors • We’ve thus far avoided any big mistakes • Used PG splitting to grow a pool from 8 to 2048 PGs • Leads to unresponsive OSDs who get marked down à degraded objs. • Safer & now-enforced to grow in 2x or 4x steps • ulimits, ulimits, ulimits • With a large number of OSDs (say, more than 500), you will hit num file and num process limits everywhere: • Glance, qemu, radosgw, ceph/rados CLI, … • If you use XFS, don’t put your OSD journal as a file on the disk • Use a separate partition, the first partition! • We still need to reinstall our whole cluster to re-partition the OSDs Scaling Ceph at CERN - D. van der Ster 24
Scale up and out • Scale up: we are demonstrating the viability of a 3PB cluster with O(1000) OSDs. • What about 10,000 or 100,000 OSDs? • What about 10,000 or 100,000 clients? • Many Ceph instances is always an option, but not ideal • Scale out: our growing data centre in Budapest brings many options: • Replicate over the WAN (though, 30ms RTT) • Tiering / Caching pools (new feature, need to get experience…) • Data locality – direct IOs to nearby replica or caching pool Scaling Ceph at CERN - D. van der Ster 25
Summary Scaling Ceph at CERN - D. van der Ster 26
Summary • CERN IT infrastructure is undergoing a private cloud revolution, and Ceph is providing the underlying storage. • Our CASTOR and XRootD physics data use- cases may exploit RADOS for improved performance/scalability. • In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well. • Presented some lessons learned, I hope they prove useful in your Ceph explorations. Scaling Ceph at CERN - D. van der Ster 27
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
Ceph Day Frankfurt. Ceph Days In Frankfurt. ... Scaling Ceph at CERN Dan van der Ster, CERN . 4:00 PM. 4:15 PM. 4:30 PM.
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
Ceph Days. Transform Storage in ... Ceph Day Switzerland. CERN-Main Auditorium Meyrin, Switzerland: June 14, 2016: ... Frankfurt, Germany: February 27 ...
... from Frankfurt Ceph Day: >> >> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern ... com > http://lists.ceph.com/listinfo.cgi/ceph-users ...
A day in the life of a Trove contributor ... Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service Dan van der Ster ...
Scaling Ceph at CERN - Ceph Day Frankfurt . 12,140 Views. juanrojochacon. Particle Physics, CERN and the Large Hadron Collider. 769 Views. MasahiroKuze1.
Scaling Ceph at CERN - Ceph Day Frankfurt. Building AuroraObjects- Ceph Day Frankfurt. Using Ceph in a Private Cloud - Ceph Day Frankfurt. Ceph at the ...
Auf dem ersten Ceph Day Deutschlands trafen sich Anwender und Entwickler aus ganz Europa, um sich über Ceph, ...
On 01.04.2014 13:38, Karol Kozubal wrote: > I am curious to know what is the largest known ceph production deployment? I would assume it is the CERN ...
View 547206 Frankfurt posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. LinkedIn Home What is LinkedIn?