AWS outage: Who’s really to blame?

 In AWS, Cloud

Amazon’s S3 service outage in the US-East-1 Region yesterday shed light on just how much of the internet depends on that single service in a single region. Dozens of sites that stored assets there failed; sites that stored snapshots of volumes they used to launch instances failed; sites that depended on AWS S3 services in that region failed. Organizations across the internet suddenly learned they had a single point of failure they didn’t previously realize existed.

As the dust clears, users are looking for ways to mitigate that vulnerability. Unfortunately, much of the chatter on Twitter and other online venues indicates that many people are missing the point: They’re treating this as a case of having chosen the wrong service or service provider, rather than understanding that the underlying problem was that they have a single point of failure.

It’s a similar reaction to when storms took down Amazon Web Services’ Sydney Availability zone last summer. Rather than say, “Perhaps I should have designed for multi-zone, multi-region, or multi-cloud,” it’s simpler to shake a fist in the air and blame a single, unreliable server or service provider. Acknowledging that you played a part in your failure, because you deployed an application in a single zone in a single region of a single cloud provider, is a tough pill to swallow.

Amazon has a phrase it uses to describe the responsibility for security on AWS – the Shared Responsibility Model. Under this model, Amazon is responsible for securing the underlying infrastructure, but you’re responsible for securing what you run on that platform. The same phrase applies to responsibility for durability and resilience.

All of the major public cloud platforms offer tools to build resilient, durable application infrastructure – usually much more resilient than you can build in your own data center. They’re responsible for giving you those components and making sure they adhere to their SLAs. They also provide guidance on how to use the tools to ensure your applications can survive service failures, and tell you that you should plan your systems accordingly.

Things happen: Systems fail, superstorms impact an area, badly configured software updates can crash a service. If your systems are integral to your business, you need to balance the risks and potential impacts of downtime, and the costs of protecting your business against them. If you’re just starting out and have only a small number of users, you may be willing to assume some risk. If you’re a major new website with tens of thousands of readers an hour, you may not want to leave yourself vulnerable to a failure of a single service in a single region.

The cloud increases — rather than eliminates — the need for good system architecture.

AWS S3, Google Cloud Storage, and Azure Blob Storage all offer ways to build redundancy around the failure of storage in a region or your ability to access that storage. There are alternatives that allow you to extend that redundancy to span clouds as well. In most cases, if you’re hosting your own application on a public cloud, there are ways to make it more durable and resilient.

Of course, those all cost more money so you’ll need to decide what the appropriate balance is for your application and business. The major cloud providers will generally give you as much redundancy as you’re willing to pay for. It’s up to you to decide how much you need, and work with a good architect to ensure you’re getting your money’s worth.

None of this diminishes the importance of the S3 outage we saw on Tuesday. But if your wake-up call is to just move your single point of failure from S3 to Google Cloud Storage or Azure Blob Storage, you’ve learned the wrong lesson.

If you need help reviewing your AWS application architecture or understanding how much redundancy you have, reach out to us. We have engineers, architects, and services that can help you get the durability you need without paying more than you need to get it.

Update: Amazon has provided an explanation for the outage on Tuesday. Read it here.

Related Posts: You may also be interested in...


Leave a Comment

19 + four =

Pin It on Pinterest