Amazon Web Services Outage

AWS outage: Our bad, admits Amazon, albeit vaguely

It turns out the widespread December 7 AWS outage was caused by Amazon's own software, and its response was hampered by … its own software. What does Amazon's post-mortem actually tell us?

The December 7 AWS outage that hobbled Amazon's own operations and took a wide range of its clients offline now has an official, if vague, explanation: It was our fault. 

More specifically, it was AWS' own internal software that caused the snafu, which basically breaks down to an automated scaling error in AWS' primary network that caused "unexpected behaviour" from a large number of clients on its internal network, which it uses to operate foundational services like monitoring, internal DNS and authorization services. 

"Because of the importance of these services in this internal network, we connect this network with multiple geographically isolated networking devices and scale the capacity of this network significantly to ensure high availability of this network connection," AWS said. Unfortunately, one of those scaling services, which AWS said had been in production for many years without issue, caused a massive surge in connection activity that overwhelmed the devices managing communication between AWS' internal and external networks at 7:30 a.m. PST. 

To make matters worse, the surge in traffic caused a massive latency spike that affected AWS' internal monitoring dashboards, which made it impossible to use the systems designed to find the source of the congestion. To find it, AWS engineers had to turn to log files, which showed an elevation in internal DNS errors. Their solution was moving DNS traffic away from congested network paths, which solved DNS errors and improved some availability, but not all.

Additional strategies tried to further isolate troubled portions of the network, bring new capacity online and the like also progressed slowly, AWS said. Its monitoring software latency was making tracking changes difficult, and its own internal deployment systems were also affected, making pushing changes harder. To make matters worse, not all AWS customers were taken down by the outage, so the team moved "extremely deliberately while making changes to avoid impacting functioning workloads," AWS said. It took time, but by 2:22 p.m. PST, AWS said all of its network devices had fully recovered. 

AWS has disabled the scaling activities that caused the event and said they will not bring the back online until all remediations have been deployed, which it said it expects to happen over the next two weeks. 

What to take away from AWS' statement on its outage
As is often the case with these sorts of statements, there's a lot of unpacking to do, particularly when AWS has been so vague, said Forrester senior analyst Brent Ellis. "The issue I see is that the description is not specific enough to give customers the ability to plan around this particular failure. Not everyone hosted on AWS failed, it would be useful to understand what those businesses were doing differently so others could follow suit.  Right now, customers have to trust AWS to rectify the situation," Ellis said. 

Ellis also said that Amazon's statement itself gives cause for alarm for reasons other than just how the outage happened: It indicates that the interaction between AWS' external and internal networks may be problematic if it can cause such widespread issues. 

That doesn't mean the cloud is a bad bet, Ellis said: he still maintains optimism that it's a "very good place to move business technology." That said, Ellis brings it back yet again to a similar refrain that's been popping up since cloud outages have been on our minds again: Risk

"Generally speaking [cloud providers] are still more redundant, secure and reliable than most enterprises' internal infrastructure, but it is not without risk," Ellis said. His personal advice to anyone worried about the cloud is to diversify, mitigate and inquire. "If you can scale a service so it runs across more than one cloud, or cloud + on-prem; then do it.  If you can't, negotiate shared business risk, inquire on [cloud provider] practices and negotiate to make those practices align with your internal resilience needs," Ellis said.  

Ellis describes planning for cloud resiliency similar to how businesses would design a secondary data center outside of the radius of a disaster to ensure continuity. The cloud takes care of all of that hassle for you, Ellis said, but in turn a single human or automation error is magnified across much larger swathes of that company's infrastructure. 

If the cloud is going to stay successful, Ellis said that cloud providers need to standardize in some way to make data easier to move, workloads easier to duplicate, and redundancy simpler. The goal, he said, would be for a situation much like that when traveling internationally: You need an adapter to fit a different sort of socket, but the underlying operating principles are shared, so all you'll need is a virtual adapter to move from Cloud A to Cloud B. 
Gartner VP of cloud services and technologies, Sid Nag, agrees with an interoperability ideal, especially in today's world where he said hyperscale providers are becoming "too big to fail." 

"More and more of our day to day lives are dependent on the cloud industry; cloud providers should work out an arrangement where they back each other up," Nag said. Like Ellis' recommendation, the ultimate goal seems to be a cloud market that realizes its essential utility to modern society and works on becoming less competitive and prone to failure. 

"That is what cloud utility computing will have to become. Once it does, building the services to move a workload when there is an issue at one cloud [provider] will become easier," Ellis said. 

Source (Brandon Vigliarolo – 13 December 2021 – TechRepublic)