Managing Cassandra at Scale by Al Tobey

50 %
50 %
Information about Managing Cassandra at Scale by Al Tobey
Technology

Published on February 24, 2014

Author: planetcassandra

Source: slideshare.net

Description

Al Tobey presents on Managing Cassandra at Scale.

Managing Cassandra : SCALE 12x @AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト ©2014 DataStax Obsessed with infrastructure my whole life. See distributed systems everywhere. !1

Five years of Cassandra 0.1 Jul-08 ... 0.3 0 0.6 1 0.7 1.0 2 1.2 3 DSE 4 2.0 5

Why Cassandra? Or any non-relational?

アルトビー Leo My boys I did not have permission from my wife to use her image ;) Rufus

! Less Tolerant For the record. Can you spell HTTP? More and more services on line. ! And they must be available

理由Cassandraを選ぶのか ! 日本で億のインターネットユーザー 100,000,000 internet users in Japan across the world … B2B, video conferencing, shopping, entertainment

! Traditional solutions… ! …may not be a good fit

Client-server All your wildest ! dreams will come ! true. Client - server database architecture. Obsolete. The original “funnel-shaped” architecture

3-tier Client - client - server database architecture. Still suitable for small applications. e.g. LAMP, RoR

3-tier + caching cache slave more complex cache coherency is a hard problem cascading failures are common Next: Out: the funnel. In: The ring. master slave

Webscale outer ring: clients (cell phones, etc.) middle ring: application servers inside ring: Cassandra servers ! Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!

When it Rains scale out … but at a cost

A new plan

Dynamo Paper(2007) • How do we build a data store that is: • Reliable • Performant • “Always On” • Nothing new and shiny ! ! Evolutionary. Real. Computer Science Also the basis for Riak and Voldemort

BigTable(2006) • Richer data model • 1 key. Lots of values • Fast sequential access • 38 Papers cited

Cassandra(2008) • Distributed features of Dynamo • Data Model and storage from BigTable • February 17, 2010 it graduated to a top-level Apache project

Cassandra - More than one server • All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server
 !18

VLDB benchmark (RWS) HBase Redis THROUGHPUT OPS/SEC) Cassandra !19 MySQL

Cassandra - Fully Replicated • Client writes local • Data syncs across WAN • Replication per Data Center !20

Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null This might be what it looks like from SQL / CQL, but … ! EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524

Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null RDBMS TNSTAAFL 無償の昼食なんてものはありません TNSTAAFL … If you’re lucky, the cell is in cache. Otherwise, it’s a disk access to read, another to write.

Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Explain distributed RMW More complicated. Will talk about how it’s abstracted in CQL later. Coordinator

Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Coordinator read write Memory replication on write, depending on RF, usually RF=3. Reads AND writes remain available through partitions. Hinted handoff.

Avoiding read-modify-write cassandra13_drinks column family Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable #CASSANDRA ©2014 DataStax ⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks !25

Avoiding read-modify-write cassandra13_drinks column family Al Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable ssTable #CASSANDRA ©2014 DataStax ⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking !26

Avoiding read-modify-write cassandra13_drinks column family memTable Albert Tuesday 22 Wednesday 0 Albert Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 ssTable ssTable #CASSANDRA ©2014 DataStax ⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22 !27

Avoiding read-modify-write cassandra13_drinks column family Albert Tuesday 22 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 0 Wednesday 1 ssTable #CASSANDRA !28 ©2014 DataStax ⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal

Overwriting CREATE TABLE host_lookup ( name varchar, id uuid, PRIMARY KEY(name) ); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! SELECT id FROM host_lookup WHERE name=“tobert.org”; Beware of expensive compaction Best for: small indexes, lookup tables Compaction handles RMW at storage level in the background. Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.

Key/Value CREATE TABLE keyval ( key VARCHAR, value blob, PRIMARY KEY(key) ); ! INSERT INTO keyval (key,value) VALUES (?, ?); ! SELECT value FROM keyval WHERE key=?; e.g. memcached Don’t do this. But it works when you really need it.

Journaling / Logging / Time-series CREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); ! INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ ); Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)

Journaling / Logging / Time-series 2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z {“key”:" value”} {“key”:"“value”} 2014(01(25 2014(01(25T13:13:13Z {“key”:"“value”} {"“2014(01(24”"=>"{ """"“2014(01(24T12:12:12Z”"=>"{ """"""""‘{“foo”:"“bar”}’ """"} } Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)

Cassandra Collections CREATE TABLE posts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); ! INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] ); quick story about 犬ばかね sets & maps are CRDTs, safe to modify

Cassandra Collections CREATE TABLE metrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) ); sets & maps are CRDTs, safe to modify

Lightweight Transactions • Cassandra 2.0 and on support LWT based on PAXOS • PAXOS is a distributed consensus protocol • Given a constraint, Cassandra ensures correct ordering

Lightweight Transactions UPDATE  users          SET  username=‘tobert’    WHERE  id=68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383          IF  username=‘renice’;   ! INSERT  INTO  users  (id,  username)   VALUES  (68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383,  ‘renice’)   IF  NOT  EXISTS;   ! ! Client error on conflict.

Brendan Gregg’s Tools Chart

dstat -lrvn 10

iostat -x 1

htop

Datastax Opscenter

Config Changes: Apache 1.0 ➜ DSE 3.0 ⁍ Schema: compaction_strategy = LCS ⁍ Schema: bloom_filter_fp_chance = 0.1 ⁍ Schema: sstable_size_in_mb = 256 ⁍ Schema: compression_options = Snappy ⁍ YAML: compaction_throughput_mb_per_sec: 0 #CASSANDRA13 !43 ©2014 DataStax ⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2

nodetool ring 10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098 10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 102671403812352122596707855690619718940 10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 104138138152528581490946539343342857782 10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 105604872492705040385185222996065996624 10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 107071606832881499279423906648789135466 10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 110042394566257506011458285920000334950 10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 113781420866907675791616368030579466301 10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 119650151395618797017962053073524524487 10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 139424886777089715561324792149872061049 10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 143588210871399618110700028431440799305 10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 146577368624928021690175250344904436129 10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042 !44 ©2014 DataStax ⁍ hotspots

nodetool cfstats Keyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Controllable by sstable_size_in_mb Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Could be using a lot of heap Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 #CASSANDRA ©2014 DataStax Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007 ⁍ bloom filters ⁍ sstable_size_in_mb !45

nodetool proxyhistograms Offset Read Latency Write Latency Range Latency 35 0 20 0 42 0 61 0 50 0 82 0 60 0 440 0 72 0 3416 0 86 0 17910 0 103 0 48675 0 124 1 97423 0 149 0 153109 0 179 2 186205 0 215 5 139022 0 258 134 44058 0 310 2656 60660 0 372 34698 742684 0 446 469515 7359351 0 535 3920391 31030588 0 642 9852708 33070248 0 770 4487796 9719615 0 924 651959 984889 0 #CASSANDRA ©2014 DataStax ⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you !46

nodetool compactionstats al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace Compaction hastur column family bytes compacted bytes total progress gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur 647389841 1475432590 43.88% column family bytes compacted Active compaction remaining time : mark_archive n/a al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur 1107897093 1475432590 75.09% Active compaction remaining time : #CASSANDRA ©2014 DataStax ⁍ mark_archive n/a !47

Stress Testing Tools ⁍ cassandra-stress ⁍ YCSB ⁍ Production ⁍ Terasort (DSE) ⁍ Homegrown #CASSANDRA !48 ©2014 DataStax ⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want

/etc/sysctl.conf kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1 #CASSANDRA !49 ©2014 DataStax ⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

/etc/rc.local ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda ! echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size #CASSANDRA !50 ©2014 DataStax ⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices

Ending discussion notes • 2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning • SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine • NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network • Avoid disk configurations targeted at Hadoop, disks are too slow • http://www.datastax.com/documentation/cassandra/2.0/pdf/ cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots

#cassandra presentations

Add a comment

Related presentations

Related pages

Albert Tobey | SCALE 12x

Albert Tobey. Presentations. Managing Cassandra. Open Source Mechanic. Datastax. Al is a father, technologist, musician, and open source advocate working ...
Read more

Al's Cassandra 2.1 tuning guide - GitHub Pages

Al's Cassandra 2.1 tuning guide ... Toggle navigation Al Tobey Writes. Pages . Cassandra 2.1 Tuning Guide; Japanese ... G1 can scale to over 256GB of RAM ...
Read more

What is Apache Cassandra? - Planet Cassandra | All of your ...

Apache Cassandra, a top level Apache project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, is a distributed database for ...
Read more

Cassandra Japan 2014 Recap by Al Tobey

He covered features of Cassandra 2.0 and 2.1 from ... Cassandra Japan 2014 Recap by Al Tobey. ... Linux talking about an R&D project to scale Zabbix past ...
Read more

Cassandra Secondary Index Preview #1 — Rustyrazorblade

Cassandra Secondary Index Preview #1 ... instead of managing the indexes independently. ... Planet Cassandra; Al Tobey;
Read more

Top Cassandra Summit Sessions For Intermediate Cassandra ...

Top Cassandra Summit Sessions For Intermediate ... latency on Cassandra queries, how to horizontally scale while ... Al Tobey’s Extreme Cassandra ...
Read more

Migrating from MySQL to Cassandra Using Spark ...

Migrating from MySQL to Cassandra Using Spark ... and at small scale it can be great. ... Planet Cassandra; Al Tobey;
Read more

Meet Up Presentations - YouTube

Skip navigation Upload. Sign in
Read more