1 sysadmin vs 250 clusters de stockage

50 %
50 %
Information about 1 sysadmin vs 250 clusters de stockage

Published on November 20, 2019

Author: ovhcom

Source: slideshare.net

1. 1 sysadmin vs 250 clusters Etienne Menguy SysadminDays November 19, 2019

2. OVHcloud D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 2 1 500 000 customers 2200 employees 380 000 Bare-metal servers

3. Ceph at OVHcloud D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 3 Public Cloud Virtual machines Additional disks Additional disks Additional disks Additional disks Cloud Disk Array As A Service

4. Evolution „2015 • 4 dev • 1 ops • 8 clusters • 4 regions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 4 „2019 • 9 dev • 250 clusters • 10 regions

5. Daily work „1 sysadmin • Monitoring • Prodding • Support • Training • Deploying regions, servers • And the daily surprises D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 5 8 devs • Ceph as a service • Infra as code • Code review • Tests • R&D

6. Ceph setup D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 6 FlashcacheFlashcacheFlashcache LXC Data LXC Data LXC Data NVME Partition Partition Partition x12 HDD HDD x12 HDD Flashcache LXC Data Bare-metal server 40Gbps NIC

7. Ceph as a service „Autonomous users • Creating cluster • Managing users, pools, rights • Managing network • Cluster growth „Backup management • 500TB/day • Ceph -> Swift • Ceph -> Ceph D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 7 „Managing our infrastructure • Cluster upgrade • Deploy new ceph versions • Manage tasks • Host management • Network management • Containers management

8. Infrastructure D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 8 Serveurs Conteneurs VM Instances BDD Puppet API Python API OVH RabbitMQ Celery

9. Task management „ RabbitMQ „ Celery • https://github.com/ovh/celery-dyrygent • Complex workflow • Reliable • Monitoring • Web interface • Planned tasks • NVME replacement • Self healing • Triggered by monitoring probe • Executes any operation D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 9

10. Example D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 10 start Check operation safety Lower disk weight Wait cluster_health_ok Remove disk from cluster Yes No Weight equals 0

11. Continuous delivery „CDS • https://github.com/ovh/cds „Each pull request • Lint • Unit test „Daily prodding • All tests executed D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 11

12. Infra as code D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 12

13. Inconsistent hardware „Hardware profile • 12 profils on production • CPU • NVME • HDD „Firmwares „Ceph versions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 13 • Generic tools • 1 profile = 1 cluster

14. Monitoring „ Automatic downtimes by tasks „ Some alarms on working hours „ Services/hosts aggregation „ 143 000 services „ 25 000 hosts „ 3 infrastructures • 6 masters • 12 satellites D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 14

15. Metrics „ Clusters metrics • Usage • Latency „ Hardware • Cpu, mermory usage • Cache hit ratio „ Service • KPI • Usage per openstack region D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 15 „ Metrics Data Platform • https://www.ovh.com/fr/data-platforms/metrics/ „ 13 Millions series „ 13 Billions points per day „ Performance • IO/s • Latency

16. Logs „ Infrastructure • OS • Ceph „ Applications • CAAS • Celery / RabbitMQ • Uniq step/task ID „ API D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 16 „ Logs Data Platform • https://www.ovh.com/fr/data- platforms/logs/ „ 15 000 logs/second „ Graylog „ Filebeat

17. D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 17 Conclusion

18. D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 18 Questions?

Add a comment