As we discussed in Clouds and DRPs, the best laid plans of man are often disrupted in completely unpredictable ways by humans and nature. If you think not, consider the items discussed below:
2007 - The Rackspace outage: A large SUV was driving down the street near Rackspace’s Grapevine, Texas data center. Suddenly, the driver’s diabetic condition caused him to pass out at the wheel, but the car nevertheless continued on. It actually jumped a side curb, then shot up a grassy mound which launched it like a ramp where it shot over several parked cars like one of those daredevil car tricks before crashing into a voltage step-down transformer serving Rackspace’s data center. Who could have possibly predicted this? But, wait - there’s more. Rackspace’s business continuity systems, which were programmed to adjust for power outages, kicked in. When the transformer was lost, the uninterruptible power supplies took over, keeping the servers running until the emergency diesel generators could take over. Meanwhile, the building’s chillers, which provided cold air to cool down the servers, shut down temporarily until the generators became operational. But at least everything was still running. That was, until the firefighters on the scene of the accident ordered the electric company to shut down all of the power in the grid in order to extricate the driver from the vehicle, which was still enmeshed in the transformer. In order to keep the servers running under emergency power, Rackspace engineers had to stop the chillers from restarting and shut down various overheating servers one at a time. Service was slow, a complete shutdown was avoided. [Information derived from Rackspace CTO John Engates as related in an excellent article by Charles Babcock in Information Week (10/15/12).] So here we have an unexpected medical emergency compounded by automobile auto-pilot, reacted to by in-place emergency procedures and derailed by in-field human responses by firefighters and the power company. How could any of this have been predicted, even by planners of redundancy and emergency recovery? No way. Most plans target single points-of-failure and cannot reasonably consider long-shot occurrences compounded by human failure. But they’re trying - Netflix has developed a program called “Chaos Monkey,” a kind of white hat hacker which it uses to detect unexpected failures. Heuristics may narrow the gap, but mother nature will always throw a curve at the best laid plans of men.
2011 - Amazon Easter Weekend Failure: Even built-in redundancies can fail. The standard cloud practice of triple-redundancy means that, if hardware fails, there are not just one but two sources of software backup that will quickly be able to provide a full system recovery. In theory, that is. On Easter weekend 2011, Amazon’s data center experienced an outage at the connection of a trunkline to a backup network. The backup recovery protocol went into effect. Nevertheless, service to thousands of customers was dropped, causing inconvenience, even financial damage to customers. How could this happen? Because the redundancy system was intended as a spot recovery system, protecting only one system at a time. It could not cope with a situation where hundreds or thousands of systems lost data at the same time, and every one of them started restoration of the data from backups, all at the same time. The cascading resulting “remirroring storm” which set off the larger failure.
2012 - Azure Leap Day Failure: Virtual machines at the Microsoft data center require certificates in order to start the virtual machines at the server farm. Because the software somehow failed to account for the leap day - February 29, 2012 - the certificate was not issued. The server’s host agent, however, didn’t see this as a software problem, but instead as a hardware problem, thus reported it to the cloud’s cluster controller, which then ordered the move to other VM hardware which, of course, also couldn’t start the VM’s and issued the same incorrect order to even more VM hardware, and so on and so forth, until a full fledged crash was in progress.
Other Unexpected Failures: To summarize the discussion on the DRP page of this site: Even if you’ve considered and expected every imaginable disaster (weather, fire, theft, power, software & hardware failures, intrusions, viruses, terrorist acts, EMPs, for example), a localized and unpredictable event could still neutralize or cripple your system (just like the Rackspace outage described above). The 2008 scaffolding collapse, when construction of the 40-story Conde Nast building on West 43rd Street in New York turned the immediate area into a disaster area, paralyzed the entire Times Square area for more than a week. In the wake of Hurricane Katrina, some secondary or remote storage systems were also damaged in addition to the primary systems. Down here in Florida, new construction constantly severs cable, telephone and electrical cables, and conduits undergo continuous deterioration due to water and condensation in the pipes.
Just be aware that no matter what you have planned for, humans or mother nature will throw a new curve at you. The good news is that, just as heuristics evolve in anti-virus software, they also evolve in identification and recovery, and will become better and better over time.