Scaling Ceph at CERN - Ceph Day Frankfurt

0 %
100 %
Information about Scaling Ceph at CERN - Ceph Day Frankfurt

Published on March 11, 2014

Author: Inktank_Ceph



Dan van der Ster, CERN

Scaling Ceph at CERN Dan van der Ster ( Data and Storage Service Group | CERN IT Department

CERN’s Mission and Tools ●  CERN studies the fundamental laws of nature ○  Why do particles have mass? ○  What is our universe made of? ○  Why is there no antimatter left? ○  What was matter like right after the “Big Bang”? ○  … ●  The Large Hadron Collider (LHC) ○  Built in a 27km long tunnel, ~200m underground ○  Dipole magnets operated at -271°C (1.9K) ○  Particles do ~11’000 turns/sec, 600 million collisions/sec ○  … ●  Detectors ○  Four main experiments, each the size of a cathedral ○  DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3

Big Data at CERN Physics Data on CASTOR/EOS ●  LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS ●  Home directories for 30k users ●  Physics analysis development ●  Project spaces for applications Service Data on AFS/NFS ●  Databases, admin applications Tape archival with CASTOR/TSM ●  RAW physics outputs ●  Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M

IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERN’s IT infrastructure ... ●  Centrally-managed and uniform hardware ○  No more service-specific storage boxes ●  OpenStack VMs for most services ○  Building for 100k nodes (mostly for batch processing) ●  Attractive desktop storage services ○  Huge demand for a local Dropbox, Google Drive … ●  Remote data centre in Budapest ○  More rack space and power, plus disaster recovery … brings new storage requirements ●  Block storage for OpenStack VMs ○  Images and volumes ●  Backend storage for existing and new services ○  AFS, NFS, OwnCloud, Data Preservation, ... ●  Regional storage ○  Use of our new data centre in Hungary ●  Failure tolerance, data checksumming, easy to operate, security, ...

Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6

12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph

Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors #  df  -­‐h  /mnt/ceph   Filesystem                                                                                                             Size    Used  Avail  Use%  Mounted  on   xxx:6789:/    3.1P    173T    2.9P      6%  /mnt/ceph  

Use-Cases Being Evaluated 1.  Images and Volumes for OpenStack 2.  S3 Storage for Data Preservation / Public Dissemination 3.  Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment.

OpenStack Volumes & Images •  Glance: using RBD for ~3 months now. •  Only issue was to increase ulimit -n above 1024 (10k is good). •  Cinder: testing with close colleagues. •  126 Cinder Volumes attached today – 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)

RBD for OpenStack Volumes •  Before general availability, we need to test and enable qemu iops/bps throttling •  Otherwise VMs with many IOs can disrupt other users. •  One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. •  Happens on VMs with many attached RBD’s. •  Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11

CASTOR & XRootD/EOS •  Exploring RADOS backend for these two HEP-developed file systems •  Gateway model, similar to S3 via RADOSGW •  CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). •  Striped RWs across many OSDs are important. •  XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects •  Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. •  Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12

Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13

Configuration and Deployment •  Dumpling 0.67.7 •  Fully Puppet-ized •  Automated server deployment, automated OSD replacement •  Very few custom ceph.conf options à •  Experimenting with the filestore  wbthrottle   •  we find that disabling it completely gives better IOps performance •  But don’t do this!!! Scaling Ceph at CERN - D. van der Ster 14 mon  osd  down  out  interval  =  900     osd  pool  default  size  =  3   osd  pool  default  min  size  =  1   osd  pool  default  pg  num  =  1024   osd  pool  default  pgp  num  =  1024   osd  pool  default  flag  hashpspool  =  true     osd  max  backfills  =  1   osd  recovery  max  active  =  1  

Cluster Activity Scaling Ceph at CERN - D. van der Ster 15

General Comments… •  In these ~7 months of running the cluster, there have been very few problems •  No outages •  No data losses/corruptions •  No unfixable performance issues •  Behaves well during stress tests •  But now we’re starting to get real/varied/creative users, and this brings up many interesting issues... •  “No amount of stress testing can prepare you for real users” - Unknown •  (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16

Latency & Slow Requests •  Best latency we can achieve is 20-40ms •  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) •  Latency can increase dramatically with heavy usage •  Don’t mix latency-bound and throughput-bound users on the same OSDs •  Local processes scanning the disks can hurt performance •  Add /var/lib/ceph to the updatedb PRUNEPATH •  If you have slow disks like us, you need to understand your disk IO scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! •  Scrubbing! •  Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming… •  “Something is flushing the buffers, blocking the OSD processes” •  Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17

Life with 250 million objects •  Recently, a user decided to write 250 million 1kB objects •  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects •  It worked – no big problems from holding this many objects. •  Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that we’re trying to understand. •  But now we want to cleanup, and it is not trivial to remove 250M objects! •  rados rmpool generated quite a load when we rm’d a 3 million object pool (some OSDs were temporarily marked down). •  Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18

Other backfilling issues •  During a backfilling event (draining a whole server), we started observing repeated monitor elections •  Caused by the mons’ LevelDBs being so active that the local SATA disks couldn’t keep up. •  When a mon falls behind, it calls an election •  Could be due to LevelDB compaction… •  We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling •  Avoid double backfilling when taking an OSD out of service: •  Start with ceph  osd  crush  rm  <osd  id> !!   •  If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19

Fun with CRUSH •  CRUSH is simple yet powerful, so it is tempting to play with the cluster layout •  But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. •  Early CRUSH planning is crucial! •  A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? •  But (assuming we don’t have a private cluster network) that would send all replication traffic via the switch uplinks – bottleneck! •  Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20

CRUSH & Data distribution •  CRUSH may give your cluster an uneven data distribution •  An OSD’s used space will scale with the number of PGs assigned to it •  After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions •  reweight-­‐by-­‐utilization   is useful to iron out an uneven PG distribution •  The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes)

RBD Reliability with 3 Replicas •  RBD devices are chunked across thousands of objects: •  A full 1TB volume is composed of 250,000 4MB objects •  If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) •  If you lose an entire PG, you can consider all RBDs to be lost / corrupted. •  Our incorrect & irrational fears: •  Any simultaneous triple disk failure in the cluster would lead to objects being lost – and somehow all RBDs would be corrupted. •  As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. •  But this is wrong!! •  The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. •  N_PGs * ~(P_diskfailure^3) / 3! •  We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22

Trust your clients •  There is no server-side per-client throttling •  A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. •  When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client •  Could use some help in the monitoring: per-client perf stats? •  One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic •  Could saturate the mon network, lead to disruptions •  RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User. Scaling Ceph at CERN - D. van der Ster 23

Fat fingers •  A healthy cluster is always vulnerable to human errors •  We’ve thus far avoided any big mistakes •  Used PG splitting to grow a pool from 8 to 2048 PGs •  Leads to unresponsive OSDs who get marked down à degraded objs. •  Safer & now-enforced to grow in 2x or 4x steps •  ulimits, ulimits, ulimits •  With a large number of OSDs (say, more than 500), you will hit num file and num process limits everywhere: •  Glance, qemu, radosgw, ceph/rados CLI, … •  If you use XFS, don’t put your OSD journal as a file on the disk •  Use a separate partition, the first partition! •  We still need to reinstall our whole cluster to re-partition the OSDs Scaling Ceph at CERN - D. van der Ster 24

Scale up and out •  Scale up: we are demonstrating the viability of a 3PB cluster with O(1000) OSDs. •  What about 10,000 or 100,000 OSDs? •  What about 10,000 or 100,000 clients? •  Many Ceph instances is always an option, but not ideal •  Scale out: our growing data centre in Budapest brings many options: •  Replicate over the WAN (though, 30ms RTT) •  Tiering / Caching pools (new feature, need to get experience…) •  Data locality – direct IOs to nearby replica or caching pool Scaling Ceph at CERN - D. van der Ster 25

Summary Scaling Ceph at CERN - D. van der Ster 26

Summary •  CERN IT infrastructure is undergoing a private cloud revolution, and Ceph is providing the underlying storage. •  Our CASTOR and XRootD physics data use- cases may exploit RADOS for improved performance/scalability. •  In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well. •  Presented some lessons learned, I hope they prove useful in your Ceph explorations. Scaling Ceph at CERN - D. van der Ster 27

Add a comment

Related presentations

Related pages

Ceph Day Frankfurt

Ceph Day Frankfurt. Ceph Days In Frankfurt. ... Scaling Ceph at CERN Dan van der Ster, CERN . 4:00 PM. 4:15 PM. 4:30 PM.
Read more

Scaling Ceph at CERN - Ceph Day Frankfurt - Technology

London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
Read more

Ceph Days

Ceph Days. Transform Storage in ... Ceph Day Switzerland. CERN-Main Auditorium Meyrin, Switzerland: June 14, 2016: ... Frankfurt, Germany: February 27 ...
Read more

[ceph-users] Largest Production Ceph Cluster

... from Frankfurt Ceph Day: >> >> ... com > ...
Read more

Ceph at CERN: A Year in the Life of a Petabyte-Scale Block ...

A day in the life of a Trove contributor ... Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service Dan van der Ster ...
Read more

Cern | LinkedIn

Scaling Ceph at CERN - Ceph Day Frankfurt . 12,140 Views. juanrojochacon. Particle Physics, CERN and the Large Hadron Collider. 769 Views. MasahiroKuze1.
Read more

Ceph Performance and Optimization - Ceph Day Frankfurt ...

Scaling Ceph at CERN - Ceph Day Frankfurt. Building AuroraObjects- Ceph Day Frankfurt. Using Ceph in a Private Cloud - Ceph Day Frankfurt. Ceph at the ...
Read more

Ceph Day - ein Tag im Zeichen des Open Source Cluster ...

Auf dem ersten Ceph Day Deutschlands trafen sich Anwender und Entwickler aus ganz Europa, um sich über Ceph, ...
Read more

A list for users of the Ceph distributed storage system ()

On 01.04.2014 13:38, Karol Kozubal wrote: > I am curious to know what is the largest known ceph production deployment? I would assume it is the CERN ...
Read more

Frankfurt | LinkedIn

View 547206 Frankfurt posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. LinkedIn Home What is LinkedIn?
Read more