Learning from First Responders: When Your Systems Have to Work by Dylan Richard

Learning from First Responders: When Your Systems Have to Work by Dylan Richard

Author:Dylan Richard
Language: eng
Format: mobi, epub, pdf
Tags: COMPUTERS / Web / General
Publisher: O’Reilly Media
Published: 2013-03-04T05:00:00+00:00


Chapter 5. Real Breaking

This time, it didn’t take four minutes to notice. Almost immediately, all of the applications failed. Game day had gone off-script and the engineers suddenly realized it. The API team had spent the previous two weeks ensuring that the software could handle a master dying by seamlessly failing into a read-only state. And yet, as soon as the master failed, our entire infrastructure went with it.

The master database failure was handled exactly as it should have been by the core API, but the identity service used the database directly (oh technical debt, you take so little time to be costly). An immediate failure spread across all of our applications, instead of the planned switchover to read-only. This was about the point where the line between reality and FAILARP-ing (live action role playing) really started to blur. Systems failing in unexpected ways brings out a visceral mixture of fear, anger, and futility that a real incident brings about.

In the backchannel, we decided to bring the master back up. Identity was such a core service that there was no point in testing anything beyond what we had tested with the master down. We noted that certain clients would need to handle a downed identity server in a better way, and that our identity server would need to grow far more fault tolerant in the immediate future. The feeling of preparedness was gone.

Rather than just fix the permission group on the master, we promoted a replicant so we could see in a controlled environment what that looked like, and again to better simulate reality; sometimes databases that die stay dead. As it turned out, if we hadn’t tested it that way we wouldn’t have known that promoting a replicant on RDS breaks replication immediately and leaves you with just a single master that you then need to stress by taking a backup to create a replicant. Information that it turns out is incredibly important to have; if you know that this occurs, you can disable endpoint to shed load to make sure that you don’t immediately kill your shiny new master.

We went on to simulate several other scenarios that mimicked real-life failures we had seen before: having replicants flap available and become unavailable (to simulate this we would just revoke access to them, then reinstate access), break your caching layer, simulate full disks (revoke write privileges), and some human error. The easiest human error simulation is to just not do a piece of a process but say that you did: for example, starting replication.

In a four-to-five-hour torrent of breaking things, we had worked through our list of things we would be testing and everyone was exhausted. In the game day incident response channel, we told the team that their nightmare was over, thanked everyone and asked them to gather their notes and compile them per workstream. At the same time in the backchannel, we decided to test the fallbacks. Almost all of our fallbacks for database failure relied on writing to an SQS queue for later processing.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Popular ebooks
Deep Learning with Python by François Chollet(12638)
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(7808)
Grails in Action by Glen Smith Peter Ledbrook(7719)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6443)
Kotlin in Action by Dmitry Jemerov(5090)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3954)
Mastering Azure Security by Mustafa Toroman and Tom Janetscheck(3355)
Learning React: Functional Web Development with React and Redux by Banks Alex & Porcello Eve(3101)
Mastering Bitcoin: Programming the Open Blockchain by Andreas M. Antonopoulos(2888)
The Art Of Deception by Kevin Mitnick(2621)
Drugs Unlimited by Mike Power(2478)
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution by Walter Isaacson(2466)
A Blueprint for Production-Ready Web Applications: Leverage industry best practices to create complete web apps with Python, TypeScript, and AWS by Dr. Philip Jones(2444)
Kali Linux - An Ethical Hacker's Cookbook: End-to-end penetration testing solutions by Sharma Himanshu(2320)
Writing for the Web: Creating Compelling Web Content Using Words, Pictures and Sound (Eva Spring's Library) by Lynda Felder(2276)
SEO 2018: Learn search engine optimization with smart internet marketing strategies by Adam Clarke(2202)
JavaScript by Example by S Dani Akash(2153)
Hands-On Cybersecurity with Blockchain by Rajneesh Gupta(2106)
DarkMarket by Misha Glenny(2096)
Wireless Hacking 101 by Karina Astudillo(2093)