Published on November 27, 2013
Running Cassandra in AWS Patrick Eaton, PhD firstname.lastname@example.org @PatrickREaton Joey Imbasciano email@example.com @_joeyi
Stackdriver at a Glance Stackdriver's hosted intelligent monitoring service helps SaaS companies innovate more by reducing the burden of day-to-day operations ● Cloud-native and cloud-aware ● Designed for complex distributed applications ● Founded by cloud/infrastructure industry veterans (Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise ● Team of ~25, based in Downtown Boston
Intelligent Monitoring Discover customer’s cloud-hosted applications ● ● ● ● Infrastructure inventory Logical units, like groups/clusters Services, hosted and self-managed Elastic resources Monitor ● ● Various data sources ● Provider metrics ● Host metrics ● Custom metrics ● Endpoints ● Events ● Health Rich visualizations Analyze ● ● ● ● ● Integrate data sources Aggregate metrics Report utilization, cost, etc. Detect policy violations Recommend actions
Lambda Architecture ● ● ● ● ● ● Typical of modern architectures for on-line applications. Formalized by Nathan Marz Composed of "batch", "speed", and "serving" layers Batch layer ○ Store of record ○ Compute arbitrary views Speed layer ○ Low latency updates ○ Streaming algorithms Serving layer ○ Combine data from batch and speed layers to answer queries Serving Speed Batch Data
Stackdriver Architecture ● ● ● ● ● Shares characteristics of lambda architecture Indexing (speed) path ○ Make "live" data available "pre-analysis" Analysis (batch) path ○ Compute aggregations ○ Create recommendations Query (serving) layer ○ Combine "live" and analyzed data to answer queries ○ May require on-the-fly analysis Alerting (speed) path (not discussed here) ○ Stream processing to detect Query (Serving) Notification (Serving) Database Indexing (Speed) Analysis (Batch) policy-based anomalies Data Alerting (Speed)
Database Options ● We chose Cassandra! ○ True P2P architecture ○ Good support for write-heavy workloads ○ Compatible data model for time series data ■ Column per metric type, timestamps as columns ● Why not MySQL? ○ Experience with operating large, sharded deployments ○ Relational data model not a good match ● Why not HBase? ○ Operational complexity - zk, hadoop, hdfs, ... ○ Special "Master" role ● Why not Dynamo? ○ Avoid vendor lock-in and high cost
Stackdriver Architecture ++ ● Archival pipeline stores all data ● Very small surface area, battle-tested ● Critical for disaster recovery ● S3 considered durable enough ● Replicated for availability Query Cassandra Roll-ups Analysis Recs Inventory Data Series Analyze ● ● ● Archive means Cassandra is "soft state" C* consolidates analysis and indexing results Properties of data in C* ● Immutable data ● Append-only ● Read-1, write-1 consistency S3 Archive Index ● Scales out easily ● Indexers, archivers, analyzers, query servers Data
Cassandra at Stackdriver Cluster Configuration ● ● ● ● ● ● Version: Datastax Community Edition 1.2.10 Replication Factor: 3 Vnodes Murmur3Partitioner Ec2Snitch ○ Aids in request efficiency ○ Enables Cassandra to ensure replicas are in different Availability Zones phi_convict_threshold: 8 -> 12 ○ Used to determine when nodes are down ○ AWS network can be spotty
Cassandra Topology in AWS Where we started... Where we are... 1 us-east-1a us-east-1a 3 2 us-east-1c us-east-1b us-east-1c Keep it balanced! us-east-1b
Cassandra EC2 Node Configuration ● m1.xlarge ○ 4 cores ○ 15 GB RAM ○ 4 ephemeral disks available ● 4 disks RAID-0 for Data Volume and CommitLog ○ ○ ○ ○ ext4 - defaults,noatime mdadm RAID-0 Compactions Heavy Read/Write IO
Cassandra Automation and Operations ● Combination of Boto, Fabric, & Puppet ○ Boto for AWS API ○ Fabric + Puppet for Bootstrapping ○ Fabric for Operations ● One command to: ○ ○ ○ ○ ○ Launch a new cluster Upsize a cluster Replace a dead node Remove existing nodes List nodes in a cluster
Our (Internal) Slogan
Cassandra Backups using S3 ● No Cassandra Powered Backups ● Restore from S3 ● Useful for major version upgrades Data S3 Bulk Loader Map Reduce 1. Data is archived when it is received 2. Bulk loader reads from S3 3. M/R re-analyzes data 4. Cassandra is repopulated Cassandra
Disaster Recover in the Wild ● ● ● ● ● ● ● ● October 23, Stackdriver suffered a total loss of our C* cluster ● Exhausted memory due to number of open file descriptors (see graph) We did not notice the problem until it was too late ● Nodes began crashing, resulted in inconsistent view of the ring Attempted to restart the cluster unsuccessfully for ~2 hours Provisioned new 36 node cluster in ~2 hours Directed “live” data to new cluster Started bulk restore operation from archive ● Full-fidelity data and aggregations No data loss due to archival pipeline See http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/
Cluster Restoration Process S3 Map Reduce Bulk Loader Historical Data New Cluster UI UI UI UI UI API UI UI Gateway New Data Old Cluster
Thank you! Yes, we are hiring! Patrick Eaton - firstname.lastname@example.org - @PatrickREaton Joey Imbasciano - email@example.com - @_joeyi
Stackdriver relies heavily on Cassandra running in AWS to store, analyze, and query across 100+ million data points per day. As we move to the public beta ...
Amazon Web Services – Apache Cassandra on AWS January 2016 Page 4 of 52 Performing Backups 41 Building Custom AMIs 42 Migration into AWS 42
Cassandra's data model offers convenience when you need to scale and deliver high availability without comprising database performance. View the ...
Earlier this week, Stackdriver’s Joey Imbasciano and AWS’s Miles Ward teamed up on a webinar titled Best Practices for Running Apache Cassandra on ...
Best practice cassandra setup on ec2 ... The above two tips should satisfy basic availability in AWS and in case your ... Cassandra running compact ...
It is also possible to efficiently double the size of a Cassandra cluster while it is running. ... Netflix is using Cassandra on AWS as a key ...
Get in the Ring with Cassandra and ... information defines AWS EC2 ... that will cover memory and disk constraints that occur when running Cassandra.
Simple Cassandra Instance In AWS EC2. ... Here are the steps to spin up an EC2 instance and getting a single instance of Cassandra running on it. AWS setup.