2012 - A Release Odyssey

40 %
60 %
Information about 2012 - A Release Odyssey
Technology

Published on March 12, 2014

Author: mxyzplk

Source: slideshare.net

Description

Lightning talk for DevOpsDays Austin 2013 on taking releases from a 10 week to 1 week cadence. Sorry about the format, had to go from Keynote to PDF and since it was a lightning talk all the actual content's in the notes.

DevOpsDays Austin 2013 @ernestmueller| @bazaarvoice 2012: A Release Odyssey Hi, I’m Ernest Mueller from Bazaarvoice here in Austin. We’re the biggest SaaS company you’ve never heard of; our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution on their sites for that purpose. We pushed out more than 1bn reviews last Cyber Monday. I’m going to tell you how we went from releasing our code once every ten weeks to once a week in a pretty short time.

The Monolith! Bazaarvoice Conversations, aka PRR, has 15,000 files and 4.9M lines of code, the oldest from Feb 2006, and that’s not counting UI versions, customer config, or operations code repos (all of which get released along with it). Written by generations of coders, including outsourcer partners. It runs across 1200 hosts in 4 datacenters; Rackspace, and AWS East, West, and Ireland. So by any measure this was a large legacy system.

BV had gone agile and said “Let’s release more quickly too! All the cool kids are doing it! We’re doing two week sprints, so let’s release biweekly - go! They tried it two weeks after a big ten-week release, and PRR v5.1 launched on January 19th, 2012. Whoops, it’s not that easy - 44 client tickets logged, mass hysteria. “Let’s not do that again!”

Enter yours truly on January 30th. “You’re hired! We want biweekly releases in a month. With zero user facing downtime. Failure is not an option! Go!” It wasn’t just an irrational need for speed, the product organization wanted to get faster A/B testing, more piloting, etc. and the engineering team wanted the benefits of a more continuous flow as well.

Careful analysis of the situation was warranted. Luckily a SWAT team had been analyzing the problem already. The two major impediments, which are frequently encountered factors in legacy implementations: • Lack of automation in testing - testing was a huge burden and couldn’t be done sufficiently in the time allotted • Poor SCM code discipline - checkins continuing up to the release

Path One - Testing! We hired up QA automation people and set them to work. We set the expectation, backed up strongly by the product team, that the development teams had to stop and do three testing sprints. We have a standard four-environment setup - dev, QA, staging, production.

JUnit testing and CIT testing in TeamCity was ramped up. A selenium-based “Testmaster” system was used to improve the level of regression automation to safe levels. More importantly perhaps, a new discipline of not running all the tests all the time - feature/story in dev, regression in QA, smoke testing in staging and production

Branching - changed over to a trunk/release branch model, splits off every 2 weeks, no commits to branch without going through a code freeze break process. Process enforcement via wiki! Trunk goes to dev twice daily, branch goes to QA, when labeled “verified” it goes to staging and then to production.

We also had a team write a feature flagging system, like the cool kids use, so we could launch features dark and then enable them later. We made the rule that all new features must be launched dark.

We couldn’t fix a couple things in time. Our Solr indexes are 20 GB and reindexing and distributing them, while doing a zero downtime deployment and keeping replication lag down needed more engineering. And our build and deploy system was pretty bad. It’s buzzword compliant - svn, TeamCity, maven, yum, puppet, rundeck, noah, but it’s actually a bit of spaghetti mess in a big crufty bash framework; builds take more than an hour and deploys take 3+ hours.

We got a delay of game due to our IPO and then were “no go” March 1. We were under a lot of management pressure to ship, but tests weren’t passing and at the new go/no-go meeting the dev managers sucked it up and declared “no go.”

First biweekly release - PRR 5.2 went out on March 6, 5 days late. 5 issues were reported by customers. 5.3 went out March 22, 1 issue reported. 5.4 went out April 5, zero issues reported. I kept in depth release metrics - number of checkins, number of process faults, number of support tickets - and they showed consistent improvement.

It took a lot of collaboration and good old fashioned project management. Product, QA, DevOps, various engineering teams, Support, and other stakeholders had to all get on the same page. We didn’t really change tooling besides adding the feature flagging - still Confluence, JIRA, and all our other tools - just using them more effectively. http://www.flickr.com/photos/senorwences/2366892425/

And the release train kept spinning. We had one major disaster on May 17, when a major architectural change to our product feeds went out in a release and generated 28 client reported issues (from a nice rolling average of . 5). We enhanced our process to link each svn checkin to a ticket and put together a page requiring per-ticket signoff from the release and started tracking more quality metrics. This got us consistently smooth releases through the summer of 2012.

But we weren’t done there. We wanted to totally pwn the old way, and the next step was weekly releases. There were still some parts of the process that were manual and painful, and we were still having some “misses” causing production issues. “If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.

This was a lot easier - the QA team worked in the background to get the test coverage numbers up and then we said to the teams, “We’re going weekly in two weeks... Same process otherwise.” Version 6.7 launched on September 27, a week after 6.6. Client reported issues stemming from a code release average around zero since that time. Solr index distribution was automated; they get regenerated before, shipped out to the data centers, brought up to date, and then swapped in during releases. Solr reindexing automation went live October 18, 2012. Then we trained the developers to take over the release process. We skipped some releases during Black Friday, but are shipping PRR 9.0 this week (in most of our absence!).

As I mentioned, our build and deployment is already automated (somewhat sketchily) with TeamCity, puppet, Rundeck, and noah. Our next step in killing off the old way is in progress by renovating our build system - moving to git with gerrit for code reviewing, and upgrading our TeamCity installation so it can be API controlled - and fixing the crappy CIT tests that have been languishing there. We have trouble currently with failing CIT because we don’t block people on it, because the failures are intermittent. We’ll get build and CIT running fast (current 1 hour build 40 minute CIT).

After that we will get rid of the bash-spaghetti deployment system we have and making deploys faster and better (current 3 hours). We’re removing the separate staging roll (staging = production because it’s client facing) and go to continuous deployment off trunk to our QA system. Some of this is technology-faster and some is process-faster - having to promote up four environments, when it takes 4 hours per, and when staging and production have to happen in maintenance windows, is slow.

And eventually... Continuous deployment. The cloud kids get to start there, but it takes some heavy lifting to get a large, established system there. But that’s the sequel, 2013: A Release Odyssey.

And that’s my story! Hit me up at theagileadmin.com And thanks to 2001: A Space Odyssey for all the screen caps I used as part of this presentation.

Add a comment

Related presentations

Related pages

The Odyssey (2012) - IMDb

Release Calendar; Top Rated Movies; Most ... Search for "The Odyssey" on Amazon.com. Connect with IMDb. Share this Rating. Title: The Odyssey (2012) ...
Read more

20??: a release odyssey - YouTube

20??: a release odyssey Benny2kk8. Subscribe Subscribed Unsubscribe 1,114 1K. ... The Best of Epic Music 2012 | 1-Hour Full Cinematic | 20 Epic ...
Read more

Used 2012 Honda Odyssey Minivan Pricing & Features | Edmunds

Edmunds has detailed price information for the 2012 Honda Odyssey Minivan. See our 2012 Odyssey page for detailed gas mileage information, insurance ...
Read more

Odyssey Release Notes 2012 - Compass Learning MT 4

This page provides an overview of the new features and enhancements included in CompassLearning Odyssey version 2012. To learn more about this release ...
Read more

2001: A Space Odyssey (2012 Trailer Recut) - YouTube

2001: A Space Odyssey (2012 Trailer Recut) Film School Rejects. ... Standard YouTube License; Movie 2001: A Space Odyssey; Show more Show less.
Read more

2012: A Space Odyssey | Free Crates

2012: A Space Odyssey by Capella, released 30 August 2012 1. Never Question feat. Stacy Epps 2. Friends Forever 3. Think Twice feat. Maromi Bekoromi 4.
Read more

2012 Honda Odyssey Specifications and Features - Honda.com

2012 Honda Odyssey Specifications and Features 9/8/2011 10:33:00 PM. ... * Based on 2012 EPA mileage estimates. ... News Releases. Company News. Environment.
Read more

Honda Releases the 2012 Odyssey, Prices it at $29,035

The people at Honda know not to mess with a good thing. As such, the company has announced only minor tweaks to the 2012 Odyssey minivan, and a miserly ...
Read more

Amazon.com: 2012 The Odyssey: Gregg Braden, Jose Arguelles ...

Amazon.com: 2012 The Odyssey: Gregg Braden, Jose Arguelles, Alberto Villoldo, John Major Jenkins, Rick Levine, Jay Weidner, Geoff Stray, Moira Timms ...
Read more