SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

50 %
50 %
Information about SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
Technology

Published on March 19, 2014

Author: SaltStack

Source: slideshare.net

Description

As infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and prereq states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.

©2013 LinkedIn Corporation. All Rights Reserved. Safety with power tools

©2013 LinkedIn Corporation. All Rights Reserved. Who’s this guy? 2

©2013 LinkedIn Corporation. All Rights Reserved. What is SRE?  Hybrid of operations and engineering  Heavily involved in architecture and design  Application support ninjas  Masters of automation 3

©2013 LinkedIn Corporation. All Rights Reserved. So, what do I do with salt?  Heavy user  Active developer  Administrator (less so) 4

©2013 LinkedIn Corporation. All Rights Reserved. What’s LinkedIn?  Professional social network  You probably all have an account  You probably all get email from us too 5

©2013 LinkedIn Corporation. All Rights Reserved. Salt @ LinkedIn  When LinkedIn started – Aug 2011: Salt 0.8.9 – ~5k minions  When I got involved – May 2012: Salt 0.9.9 – ~10k minions  Today – Now: 2014.01 – ~30k minions 6

©2013 LinkedIn Corporation. All Rights Reserved. How should you manage a service? 7

©2013 LinkedIn Corporation. All Rights Reserved. That’s not much of an answer…  Depends on use – Home – School – Hack – Work  How you manage the service changes over time – Make it work – very manual long time to get it to work (more of a work of art…) – Reproducibly make it work – Script it out – And more? 8

©2013 LinkedIn Corporation. All Rights Reserved. Apache Traffic Server

©2013 LinkedIn Corporation. All Rights Reserved. ATS: Apache Traffic Server  Fast, scalable and extensible HTTP/1.1 compliant caching proxy server.  Non-blocking IO  Plugin architecture  This is the real logo

©2013 LinkedIn Corporation. All Rights Reserved. Example: ATS deployment @ LinkedIn  When I started, deployment was less than ideal: – Check into SVN – SCP files to hosts – Manually remove host from rotation – Replace files and install RPMs – Restart trafficserver – Check some logs to see if its broken – Put it in rotation and hope you didn’t miss anything 11

©2013 LinkedIn Corporation. All Rights Reserved. 12

©2013 LinkedIn Corporation. All Rights Reserved. Example: ATS deployment @ LinkedIn  So many steps! – Manual config management – Manual rpm deployment – Manual * (<- seriously, you name it!)  Works for a while, but doesn’t scale  Very VERY error prone 13

©2013 LinkedIn Corporation. All Rights Reserved. Solution? Automation with Salt!  Pillars, runners, and modules, Oh My!  States make this dead simple

©2013 LinkedIn Corporation. All Rights Reserved. Obligatory SLS formulas ats: pkg: - installed - pkgs: - trafficserver: x.x.x-xx - trafficserver-plugin-header-rewrite: x.x.x-x ... (there are lots) service: - name: trafficserver - running /etc/trafficserver/records.config: file.managed: - makedirs: True - user: nobody - group: nobody - mode: 600 - source: http://repo/ats/records.config - source_hash: md5=20d90b82bb3a4f95d7f17d1be6257246 15

©2013 LinkedIn Corporation. All Rights Reserved. Great, SLS– like I wasn’t going to see those @ SaltConf  Had to, sorry! 16

©2013 LinkedIn Corporation. All Rights Reserved. What is Salt? 17

©2013 LinkedIn Corporation. All Rights Reserved. What is Salt @ LinkedIn?  Remote execution – Salt * cmd.run date -s "`date`” (leap-pocalypse anyone?)  “Catchall” deployment system – ATS – Couchbase – Etc.  Automation platform – Remote execution behind LinkedIn’s new standardized deployment – Cache copy + torrent-style file distribution (in migration to Salt!) 18

©2013 LinkedIn Corporation. All Rights Reserved. So what’s this about power tools?  Growing up my dad and I did a lot of cabinetry work  In the old days you did all this by hand  There are actually quite a few similarities 19

©2013 LinkedIn Corporation. All Rights Reserved. Learning to be a carpenter  Learning in general you start with the basics and move up – Calculator-less math classes anyone?  Carpentry 101: learn the basic tools – Hand saws – Sandpaper – Hammer 20

©2013 LinkedIn Corporation. All Rights Reserved. Learning to be a carpenter  As a kid I always thought it was ridiculous to use these since I could *see* the power tools my dad was using  With more experience you can use more tools, once you know how to use the ones you have – Tools need to be respected and used properly – Some tools aren’t worth learning the hard way (chainsaws!) 21

©2013 LinkedIn Corporation. All Rights Reserved. So, SaltConf is about carpentry??  Well, not so much  Computers have lots of different tools – ssh – scp – Package managers – Etc.  As we scale it’s no longer practical to use all these manual tools, so we use power tools (automation) 22

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt?  Understand the problem  Learn the tool  Test the solution  Watch for the result 23

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Understand the problem  “If you can't explain it simply, you don't understand it well enough.” – Albert Einstein  What are you trying to automate? – Is this full stack? Or just the application? – What is already automated? – Should it be automated?  Learn how to do it without the tooling – Knowing how to do the deploy manually will help you when you need to debug 24

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Learn the tool  “99% of the time you don’t have to write modules to use salt” – *Most* things you want to do can be done with existing code – If you find something that you think needs new code, reach out to the community– someone else probably wants it too!  Learn what it can and can’t do  Keep up with new features coming out as well as coming up  Continually train yourself and your users  Little things can add up: – In your __virtual__ function check your dependencies(~5 lines x ~30K minions) 25

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  Don’t’ be that guy 26

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  Fact: “AUTOMATION IS CODE!”  It is common to set up extensive tests for code, but less so for automation  In many ways automation testing is just as if not more important! – This applies to SLS formulas, modules, runners, AND salt itself. – Staging is production for infrastructure! 27

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  How do we do this @ LinkedIn? – Code reviews – VM environment: a pre-staging environment for testing – Stress tests: pathological test cases – Canary process: careful code rollouts 28

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Watch for the result  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences 29

©2013 LinkedIn Corporation. All Rights Reserved. Innocent enough right? @_withJMXConnection def domains(connection): ''' returns a list of domains available ''' domains = list(connection.getDomains()) domains.sort() return domains 30 Wait, what’s that decorator?

©2013 LinkedIn Corporation. All Rights Reserved. See the problem? class _withJMXConnection(object): connection = None def __init__(self, fn, url): self.fn = fn if not _withJMXConnection.connection: # set up a jmx connection ... jpype.startJVM(“libjvm.so", "-Dcom.sun.management.jmxremote.authenticate=false", "-Xms20m", "-Xmx20m") jmxurl = jpype.javax.management.remote.JMXServiceURL(url) jmxsoc = jpype.javax.management.remote.JMXConnectorFactory.connect(jmxurl) _withJMXConnection.connection = jmxsoc.getMBeanServerConnection() self.connection = _withJMXConnection.connection 31 Spins up a JVM!

©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Watch it  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences  What metrics do we watch? – CPU (load and utilization) – Memory (real AND virtual) – TCP sessions (and overflows!) – Event bus (MasterEvent and MinionEvent) – Etc. 32

©2013 LinkedIn Corporation. All Rights Reserved. Now everything is AWESOME!!! 33

©2013 LinkedIn Corporation. All Rights Reserved. NOPE! Still can have problems 34

©2013 LinkedIn Corporation. All Rights Reserved. Problems @ scale  timeouts that didn’t work – (#3431) original implementation relied on the zmq poller timeout, which you never hit if the event bus was relatively busy  salt-master memory leaks (all gone now ) – Zeromq3 – Reaping master child processes which crash  Performance problems on master (we’ve dropped CPU usage by ~80%) – Change max open files check to not run per minion request – Don't load minion modules every pillar call  Slow yumpkg5 module – Went from 20s -> 60s! Now down to ~9s (for 55 packages) 35

©2013 LinkedIn Corporation. All Rights Reserved. Other features we’ve added  yumpkg – support for specific versions (back in the day) – major performance enhancements to the yumpkg module  Compound matchers (range & minion data)  Prereq state  Client_acl_blacklist  Check and set (cas) to the data module  depends decorator  iterative file hashing in fileclient  hash cache for fileserver + hash cache reaping  limit memory consumption on module load in *nix  kwarg passing with types  Profiler within master process 36

©2013 LinkedIn Corporation. All Rights Reserved. client_acl_blacklist (new in 0.13.0)  Salt had support for whitelisting, and per-user access control  Wanted to blacklist certain modules/users – No root (require sudo) – No cmd module (protect against fat-fingering) client_acl_blacklist: users: - root - '^(?!sudo_).*$' # all non sudo users modules: - cmd 37

©2013 LinkedIn Corporation. All Rights Reserved. Prereq state (new in 0.16.0)  Came up as we started migrating our deployments to salt states  Motivation was to take hosts out of rotation before deployment  This feature lets us remove our own custom wrappers! graceful-down: cmd.run: - name: service apache graceful - prereq: - file: site-code site-code: file.recurse: - name: /opt/site_code - source: salt://site/code 38

©2013 LinkedIn Corporation. All Rights Reserved. Kwarg passing with types  Found while trying to pass a pillar as a kwarg to a module (p.s. don’t)  Kwargs were cast as strings and passed as an arg – Fine if the __str__ representation == yaml – Problem if the __str__ representation != yaml  Put all kwargs in a single dict (marked as the kwarg dict) to maintain type 39

©2013 LinkedIn Corporation. All Rights Reserved. Takeaways  Respect the tool! – Understand the problem – Learn the tool – Test the solution – Watch for the result  Be active in the community  Don’t just consume, Contribute!  Have FUN! 40

©2013 LinkedIn Corporation. All Rights Reserved. Got more questions about Salt @ LinkedIn  Interested in how we manage Salt @ Scale? – Breakout session with Craig Sebenik @ 11:15 am in Sundance  Got questions? – Drop by our SaltConf booth! – Connect with me on LinkedIn www.linkedin.com/in/jacksontj – Jacksontj on #salt on freenode 41

#salt presentations

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ...

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ... and great care should be exercised with power tools. ... Thomas Jackson, LinkedIn ...
Read more

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ...

Hand Tools Safety Tips. What are hand tools? Hand tools have no power source, other than the physical force applied by the user. Tools are such a common.
Read more

SaltConf14 - YouTube

Most of the SaltStack customer and users talks from SaltConf14 held in Salt Lake City ... SaltConf14 - Thomas Jackson, LinkedIn ... Policy & Safety Send ...
Read more

Safety Tools - Education - DOCUMENTS.MX

Safety Tools ; 2 ... Hand and Power Tools General Safety Lecture 22 ... SaltConf14 - Thomas Jackson, LinkedIn ...
Read more