Ceph Day San Jose - Object Storage for Big Data

67 %
33 %
Information about Ceph Day San Jose - Object Storage for Big Data

Published on March 20, 2017

Author: Inktank_Ceph

Source: slideshare.net

1. OBJECT STORAGE FOR BIG DATA Mengmeng Liu Senior Manager, Big Fast Data Technology CEPH DAY SAN JOSE Kyle Bader Senior Solution Architect


3. 3 WHAT THEY WANT RED HAT ● Support diverse ecosystem of data analysis tools ● Independent scaling of compute and storage ● Rapid provisioning of data labs ● Public Cloud / Private Cloud architectural parity ● Control costs, predictable costs

4. 4 THE ELEPHANT IN THE ROOM RED HAT Batch analytics with Map-Reduce on HDFS Exit to other Data warehousesIngest Persistent Data

5. 5 PLURALITY OF ANALYTICS TOOLS RED HAT Single source of truth Exit to other Data warehousesIngest Persistent Data

6. BIG DATA LINEAGE Multiple data copies from each stage of data transformation, potentially in different places Ingest The more analytical processing clusters you have, the harder it is to know which data is where.

7. 7 ● Ad-hoc provisioning of transient analytics clusters ● Scale compute up for workload ● Terminate compute post workload to reclaim resources ● Allow non-analytic workloads to make use of excess compute ● Break linear relationship between compute and storage scaling DISAGGREGATION HIGHLIGHTS RED HAT

8. 8 ● Striped Erasure Coding – Overhead 1.5x vs replication ● Strongly consistent ● Rich access control semantics ● Bucket versioning ● Active / active multi-site replication (asynchronous) CEPH OBJECT STORE HIGHLIGHTS RED HAT

9. ARCHITECTURE—ELASTIC DATA LAKE Disaggregating compute resources from an object storage solution enables the most flexibility • Ingest from multiple sources using S3A API • Analytics operate directly without expensive, time- consuming ETL or replication • Analytics performed using batch or interactive tools • Exploratory analysis support by ephemeral clusters ANALYTICS (IN SITU) EPHEMERAL CLUSTERS INGEST APPS BATCH ANALYTICS INTERACTIVE FRAMEWORKS S3A Compatible API Object Storage Solution S3 / NFS S3A Data Laboratories Batch HDFS ELASTIC DATA LAKE STREAMS S3 / NFS Interactive Query Engine Resource Mgmt

10. WHAT WE’RE TESTING Measuring key use cases from our Center of Excellence from query to storage platform Test Execution ● Use cases and workloads ● Structured / Log data ● … Parameters ● Query engines ● Data volume ● Object storage locations ● Configurations ANALYTICS (IN SITU) INGEST APPS BATCH ANALYTICS INTERACTIVE FRAMEWORKS S3A Compatible APIS3 / NFS S3A ELASTIC DATA LAKE STREAMS S3 / NFS Query Engine Resource Mgmt AWS BenchmarkEvaluationSuite

11. 11 ○ Decouple compute from storage ○ Leverage Openstack/Ceph and OneOps (lifecycle management tool open sourced by @Walmartlabs: https://github.com/oneops) ○ Predictable SLAs, flexibility of big data software versions, tenant-specific cloud deployment and operations, and a shared data lake with 1.5x replication ○ Analogous to the AWS + S3 model OUR BIG DATA JOURNEY @WALMARTLABS Started to build on-premise Openstack/Ceph clouds in early 2016

12. 12 ● Ceph RGW supports S3A/Swift RESTful APIs ● S3A/Swift hadoop connectors/drivers in the hadoop codebase ● Interact with Ceph objects like files in HDFS (but no append) ● External tables in Hive metastore MAKING THE CONNECTION RED HAT

13. 13 ● https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop- aws/index.html ● S3A connector in HDFS client fs.s3a.{access,secret}.key fs.s3a.endpoint ● Use in conjunction with Hive external tables create database mydb location 's3a://bucket/mydb'; MAKING THE CONNECTION: S3A RED HAT

14. 14 ● http://hadoop.apache.org/docs/current//hadoop-openstack/index.html ● Swift connector in HDFS client fs.swift.service.region_name.{uername,password,tenant,region,public} fs.swift.service.region_name.auth.url ● Use in conjunction with Hive external tables create database mydb location ’swift://container.region_name/mydb'; MAKING THE CONNECTION: SWIFT RED HAT

15. 1 5 ● Performant S3A connector features in unreleased Hadoop 2.8.0- SNAPSHOT+ ○ Lazy Seek ○ Better threadpool implementation ○ Multi-region support ○ http://events.linuxfoundation.org/sites/events/files/slides/2016-11-08- Hadoop,%20Hive,%20Spark%20and%20Object%20Stores%20-final.pdf ● The aws-sdk-java library that S3A depends on keeps changing and Ceph needs to catch up testing against the latest ● A lot of testing against S3A going on @ Ceph LIMITATIONS OF S3A CONNECTOR

16. 1 6 ● Job failures and inferior performance running large scale Hive/Spark/Presto queries ○ Uncontrolled number of HTTP requests (list, rename, copy, etc) ○ Listing large number of objects (bound by a limited number, e.g., 10000 objects) ○ ORC range queries throw EOF exceptions ○ Re-auth expiration issue with deadlocks LIMITATIONS OF SWIFT CONNECTOR

17. 17 1. Re-designed and implemented new thread-pools for list, copy, delete and rename 2. Re-designed and implemented pagination and iterator-based object listing to reduce memory footprint retrieving large number of objects 3. Implemented lazy seek (which shows improvement of some Presto queries to 30x- 100x) 4. Implemented LRU cache to reduce the overhead of HEAD requests 5. Fixed the frequent re-auth expiration error with issues of deadlock 6. Addressed the ORC range query issue with dynamic offsets based on file lengths 7. Added metrics support for input stream, output stream, and swift file system, etc 8. Added support for parallel uploads during data creation (multi-part upload) INTRODUCING SWIFTA: A PERFORMANT SWIFT CONNECTOR Tons of performance improvement features while retaining core functionalities of Swift API (@Walmartlabs will open source this performant SwiftA hadoop connector this year and present it in OpenStack Summit 2017 at Boston)


19. 19 RED HAT

20. 2 0 RED HAT

21. 21 RED HAT

22. 2 2 LOGOS! RED HAT

Add a comment

Related presentations