Scaling Ceph at CERN - Ceph Day Frankfurt

25 %
75 %
Information about Scaling Ceph at CERN - Ceph Day Frankfurt

Published on March 11, 2014

Author: Inktank_Ceph



Dan van der Ster, CERN

Scaling Ceph at CERN Dan van der Ster ( Data and Storage Service Group | CERN IT Department

CERN’s Mission and Tools ●  CERN studies the fundamental laws of nature ○  Why do particles have mass? ○  What is our universe made of? ○  Why is there no antimatter left? ○  What was matter like right after the “Big Bang”? ○  … ●  The Large Hadron Collider (LHC) ○  Built in a 27km long tunnel, ~200m underground ○  Dipole magnets operated at -271°C (1.9K) ○  Particles do ~11’000 turns/sec, 600 million collisions/sec ○  … ●  Detectors ○  Four main experiments, each the size of a cathedral ○  DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3

Big Data at CERN Physics Data on CASTOR/EOS ●  LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS ●  Home directories for 30k users ●  Physics analysis development ●  Project spaces for applications Service Data on AFS/NFS ●  Databases, admin applications Tape archival with CASTOR/TSM ●  RAW physics outputs ●  Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M

IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERN’s IT infrastructure ... ●  Centrally-managed and uniform hardware ○  No more service-specific storage boxes ●  OpenStack VMs for most services ○  Building for 100k nodes (mostly for batch processing) ●  Attractive desktop storage services ○  Huge demand for a local Dropbox, Google Drive … ●  Remote data centre in Budapest ○  More rack space and power, plus disaster recovery … brings new storage requirements ●  Block storage for OpenStack VMs ○  Images and volumes ●  Backend storage for existing and new services ○  AFS, NFS, OwnCloud, Data Preservation, ... ●  Regional storage ○  Use of our new data centre in Hungary ●  Failure tolerance, data checksumming, easy to operate, security, ...

Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6

12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph

Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors #  df  -­‐h  /mnt/ceph   Filesystem                                                                                                             Size    Used  Avail  Use%  Mounted  on   xxx:6789:/    3.1P    173T    2.9P      6%  /mnt/ceph  

Use-Cases Being Evaluated 1.  Images and Volumes for OpenStack 2.  S3 Storage for Data Preservation / Public Dissemination 3.  Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment.

OpenStack Volumes & Images •  Glance: using RBD for ~3 months now. •  Only issue was to increase ulimit -n above 1024 (10k is good). •  Cinder: testing with close colleagues. •  126 Cinder Volumes attached today – 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)

RBD for OpenStack Volumes •  Before general availability, we need to test and enable qemu iops/bps throttling •  Otherwise VMs with many IOs can disrupt other users. •  One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. •  Happens on VMs with many attached RBD’s. •  Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11

CASTOR & XRootD/EOS •  Exploring RADOS backend for these two HEP-developed file systems •  Gateway model, similar to S3 via RADOSGW •  CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). •  Striped RWs across many OSDs are important. •  XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects •  Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. •  Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12

Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13

Configuration and Deployment •  Dumpling 0.67.7 •  Fully Puppet-ized •  Automated server deployment, automated OSD replacement •  Very few custom ceph.conf options à •  Experimenting with the filestore  wbthrottle   •  we find that disabling it completely gives better IOps performance •  But don’t do this!!! Scaling Ceph at CERN - D. van der Ster 14 mon  osd  down  out  interval  =  900     osd  pool  default  size  =  3   osd  pool  default  min  size  =  1   osd  pool  default  pg  num  =  1024   osd  pool  default  pgp  num  =  1024   osd  pool  default  flag  hashpspool  =  true     osd  max  backfills  =  1   osd  recovery  max  active  =  1  

Cluster Activity Scaling Ceph at CERN - D. van der Ster 15

General Comments… •  In these ~7 months of running the cluster, there have been very few problems •  No outages •  No data losses/corruptions •  No unfixable performance issues •  Behaves well during stress tests •  But now we’re starting to get real/varied/creative users, and this brings up many interesting issues... •  “No amount of stress testing can prepare you for real users” - Unknown •  (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16

Latency & Slow Requests •  Best latency we can achieve is 20-40ms •  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) •  Latency can increase dramatically with heavy usage •  Don’t mix latency-bound and throughput-bound users on the same OSDs •  Local processes scanning the disks can hurt performance •  Add /var/lib/ceph to the updatedb PRUNEPATH •  If you have slow disks like us, you need to understand your disk IO scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! •  Scrubbing! •  Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming… •  “Something is flushing the buffers, blocking the OSD processes” •  Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17

Life with 250 million objects •  Recently, a user decided to write 250 million 1kB objects •  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects •  It worked – no big problems from holding this many objects. •  Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that we’re trying to understand. •  But now we want to cleanup, and it is not trivial to remove 250M objects! •  rados rmpool generated quite a load when we rm’d a 3 million object pool (some OSDs were temporarily marked down). •  Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18

Other backfilling issues •  During a backfilling event (draining a whole server), we started observing repeated monitor elections •  Caused by the mons’ LevelDBs being so active that the local SATA disks couldn’t keep up. •  When a mon falls behind, it calls an election •  Could be due to LevelDB compaction… •  We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling •  Avoid double backfilling when taking an OSD out of service: •  Start with ceph  osd  crush  rm  <osd  id> !!   •  If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19

Fun with CRUSH •  CRUSH is simple yet powerful, so it is tempting to play with the cluster layout •  But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. •  Early CRUSH planning is crucial! •  A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? •  But (assuming we don’t have a private cluster network) that would send all replication traffic via the switch uplinks – bottleneck! •  Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20

CRUSH & Data distribution •  CRUSH may give your cluster an uneven data distribution •  An OSD’s used space will scale with the number of PGs assigned to it •  After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions •  reweight-­‐by-­‐utilization   is useful to iron out an uneven PG distribution •  The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes)

RBD Reliability with 3 Replicas •  RBD devices are chunked across thousands of objects: •  A full 1TB volume is composed of 250,000 4MB objects •  If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) •  If you lose an entire PG, you can consider all RBDs to be lost / corrupted. •  Our incorrect & irrational fears: •  Any simultaneous triple disk failure in the cluster would lead to objects being lost – and somehow all RBDs would be corrupted. •  As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. •  But this is wrong!! •  The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. •  N_PGs * ~(P_diskfailure^3) / 3! •  We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22

Trust your clients •  There is no server-side per-client throttling •  A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. •  When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client •  Could use some help in the monitoring: per-client perf stats? •  One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic •  Could saturate the mon network, lead to disruptions •  RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User. Scaling Ceph at CERN - D. van der Ster 23

Fat fingers •  A healthy cluster is always vulnerable to human errors •  We’ve thus far avoided any big mistakes •  Used PG splitting to grow a pool from 8 to 2048 PGs •  Leads to unresponsive OSDs who get marked down à degraded objs. •  Safer & now-enforced to grow in 2x or 4x steps •  ulimits, ulimits, ulimits •  With a large number of OSDs (say, more than 500), you will hit num file and num process limits everywhere: •  Glance, qemu, radosgw, ceph/rados CLI, … •  If you use XFS, don’t put your OSD journal as a file on the disk •  Use a separate partition, the first partition! •  We still need to reinstall our whole cluster to re-partition the OSDs Scaling Ceph at CERN - D. van der Ster 24

Scale up and out •  Scale up: we are demonstrating the viability of a 3PB cluster with O(1000) OSDs. •  What about 10,000 or 100,000 OSDs? •  What about 10,000 or 100,000 clients? •  Many Ceph instances is always an option, but not ideal •  Scale out: our growing data centre in Budapest brings many options: •  Replicate over the WAN (though, 30ms RTT) •  Tiering / Caching pools (new feature, need to get experience…) •  Data locality – direct IOs to nearby replica or caching pool Scaling Ceph at CERN - D. van der Ster 25

Summary Scaling Ceph at CERN - D. van der Ster 26

Summary •  CERN IT infrastructure is undergoing a private cloud revolution, and Ceph is providing the underlying storage. •  Our CASTOR and XRootD physics data use- cases may exploit RADOS for improved performance/scalability. •  In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well. •  Presented some lessons learned, I hope they prove useful in your Ceph explorations. Scaling Ceph at CERN - D. van der Ster 27

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Ceph Day Frankfurt

Ceph Day Frankfurt. Ceph Days In Frankfurt. ... Scaling Ceph at CERN Dan van der Ster, CERN . 4:00 PM. 4:15 PM. 4:30 PM.
Read more

Scaling Ceph at CERN - Ceph Day Frankfurt - Technology

London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
Read more

Ceph Days

Ceph Days. Transform Storage in ... Ceph Day Switzerland. CERN-Main Auditorium Meyrin, Switzerland: June 14, 2016: ... Frankfurt, Germany: February 27 ...
Read more

[ceph-users] Largest Production Ceph Cluster

... from Frankfurt Ceph Day: >> >> ... com > ...
Read more

Ceph at CERN: A Year in the Life of a Petabyte-Scale Block ...

A day in the life of a Trove contributor ... Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service Dan van der Ster ...
Read more

Cern | LinkedIn

Scaling Ceph at CERN - Ceph Day Frankfurt . 12,140 Views. juanrojochacon. Particle Physics, CERN and the Large Hadron Collider. 769 Views. MasahiroKuze1.
Read more

Ceph Performance and Optimization - Ceph Day Frankfurt ...

Scaling Ceph at CERN - Ceph Day Frankfurt. Building AuroraObjects- Ceph Day Frankfurt. Using Ceph in a Private Cloud - Ceph Day Frankfurt. Ceph at the ...
Read more

Ceph Day - ein Tag im Zeichen des Open Source Cluster ...

Auf dem ersten Ceph Day Deutschlands trafen sich Anwender und Entwickler aus ganz Europa, um sich über Ceph, ...
Read more

A list for users of the Ceph distributed storage system ()

On 01.04.2014 13:38, Karol Kozubal wrote: > I am curious to know what is the largest known ceph production deployment? I would assume it is the CERN ...
Read more

Frankfurt | LinkedIn

View 547206 Frankfurt posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. LinkedIn Home What is LinkedIn?
Read more