The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2 by Thomas A. Limoncelli (2014-09-13)

there is a scheduled “change freeze” for holidays, quarterly financial reporting, or days when too many key people are on vacation. A simple list of dates to pause deployments covers these situations. • Oncall Schedule: To avoid dragging the current oncall engineer out of bed, pushes may be paused during typical sleeping hours. If there are multiple oncall teams, each covering different times of the day, each may have a different idea of waking hours. Even a three-shift, follow-the-sun

and what were the lessons learned. Demonstrate that you are using the experience to improve in the future. If possible, include human elements such as heroic efforts, unfortunate coincidences, and effective teamwork. You may also include what others can learn from this experience. It is important that such communication be authentic, admit failure, and sound like a human, not a press agent. Figure 14.1 is an example of good external communication. Notice that it is written in the first person,

larger an organization grows, the more likely that these dependencies will be found only through active testing. 15.4.2 Increasing Scope Over time the tests can grow to include more teams. One can raise the bar for testing objectives, including riskier tests, live tests, and the removal of low-value tests. Today, Google’s DiRT process is possibly the largest such exercise in the world. By 2012 the number of teams involved had multiplied by 20, covering all SRE teams and nearly all services.

how it has improved resilience and maximized availability in “The Antifragile Organization” (Tseitlin 2013). Later Steven Levy was allowed to observe Google’s annual DiRT process first-hand for an article he wrote for Wired magazine titled “Google Throws Open Doors to Its Top-Secret Data Center” (Levy 2012). After the 2012 U.S. presidential election, an article in The Atlantic magazine, “When the Nerds Go Marching in,” described the Game Day exercises conducted by the Obama for America campaign

326 public safety arena, 325 Incident Commanders, 324–325, 328 Index lookup speed, 28 Individual training for disaster preparedness, 311–312 Informal review workflows, 280 Infrastructure automation strategies, 217–220 DevOps, 185 service platform selection, 67 Infrastructure as a Service (IaaS), 51–54 Infrastructure as code, 221–222 Inhibiting alert messages, 356–357 Initial level in CMM, 405 Innovating, 148 Input/output (I/O) overload, 13 virtual environments, 58–59

