Why Amazon Went Down, and Why It Matters

Alistair Croll | Friday, June 6, 2008 | 2:32 PM PT | 46 comments

Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and now appears to be back up. Amazon’s not naming names — all that director of strategic communications Craig Berman would say was that: “Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems.”

Berman did confirm, however, that neither Amazon Web Services nor international sites were affected.

So what happened? Let’s look at the facts.

  • Traffic to https://www.amazon.com was getting there. So DNS was configured properly to send traffic to Amazon’s data centers. Global server load balancing (GSLB) is the first line of defense when a data center goes off the air. Either GSLB didn’t detect that the main data center was down, or there was no spare to which it could send visitors.
  • When traffic hit the data center, the load balancer wasn’t redirecting it. This is the second line of defense, designed to catch visitors who weren’t sent elsewhere by GSLB.
  • If some of the servers died, the load balancer should have taken them out of rotation. Either it didn’t detect the error, or all the servers were out. This is the third line of defense.
  • Most companies have an “apology page” that the load balancer serves when all servers are down. This is the fourth line of defense, and it didn’t work either.
  • The HTTP 1.1 message users saw shows something that “speaks” HTTP was on the other end. So this probably wasn’t a router or firewall.

This sort of thing is usually caused by a misconfigured HTTP service on the load balancer. But that would happen late at night, be detected, and rolled back. It could also happen from a content delivery network (CDN) not retrieving the home page properly.

So my money’s on an AFE or CDN problem. But as Berman notes, Amazon’s store is a complex application and much of their infrastructure doesn’t follow “normal” data center design. So only time (and hopefully Amazon) will tell.

Site operators can learn from this: Look into GSLB, and make sure you have geographically distributed data centers (possibly through AWS Availability Zones.) It’s another sign we can’t take operations for granted, even in the cloud.

8 trackbacks so far

June 6th, 2008
8:52 PM PT

[...] Pie in the sky guesses as to why Amazon went down - but since Om is into Infrastructure now - he runs these sort of speculative posts.  Its actually written by someone named Allistair Croll. [...]

June 7th, 2008
4:26 AM PT

[...] Amazon went down today . . . a rare occurrence. Gigaom explains why it may have happened and what it means to you . . . read it here. [...]

June 8th, 2008
12:50 AM PT

[...] (Alistair Croll/GigaOM) Category: Techmeme | source article link Alistair Croll / GigaOM: Why Amazon Went Down, and Why It Matters  —  Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and [...]

June 9th, 2008
12:45 PM PT

[...] have what??? Posted by: The ADC in Events, Opinion Alistair Croll over at Gigaom had an interesting dissection of Amazon’s recent outage and an interesting deduction based on the facts regarding what went [...]

June 10th, 2008
4:18 PM PT

[...] Friday, Amazon’s U.S. site went off the air, and later some of its other properties were unavailable. Lots of folks who wouldn’t let me quote [...]

June 11th, 2008
9:20 AM PT

[...] downtime incidents crossed our paths recently that we thought deserved comment. You probably have read about the first (Amazon), but maybe missed the second (Southern Company Nuclear Power [...]

June 13th, 2008
7:49 AM PT

[...] light of Amazon’s latest downtime issues, Gigaom explains why Amazon went down and why it matters. In a thorough explanation, Gigaom bets the problem to be with the CDN or AFE. The moral of their [...]

June 27th, 2008
8:48 AM PT

[...] isn’t the first time load balancers have been implicated in an outage at Amazon. At O’Reilly’s Velocity conference, conference co-chair Jesse Robbins talked about a [...]

38 comments so far

June 6th, 2008
2:46 PM PT
Satya Narayan Dash said:

Excellent and timely piece Om.

Not relevant to your blog, but wanted to put. I hate your search. It is really bad. When I search for “GigaOM Show”, it does not even come in the first 10 links.

One more which I would like to point out - can not go back using the back button (I use Opera) and I do not think you are using AJAX.

Please do something on it, at least on search.

Satya

June 6th, 2008
2:48 PM PT
Satya said:

Sorry. It is written by Alistair and thanks Alistair for the timely piece.

Others comments - I reall want the search to be better.

Satya

June 6th, 2008
3:23 PM PT
Dan said:

I humbly suggest the more accurate title “I Have No Idea Why Amazon Went Down, Either.”

June 6th, 2008
3:25 PM PT
Liz said:

Oh, my, I hit my limit. Translation, please?

June 6th, 2008
3:26 PM PT

Frankly, I wish I’d thought to grab the HTTP headers at the time, which would have told me a lot more.
As a friend of mine points out, the thing that’s answering this “speaks” HTTP. Which means it’s likely a proxy of some sort — it’s conversant, but the thing behind it isn’t. And there were multiple DNS entries in the DNS response, so whatever happened blanketed several sites
This could be the caching layer, the security layer, or anything else designed to sit between the Internet and the application servers themselves.

June 6th, 2008
3:47 PM PT
Tbone said:

More than likely a human error, internal misconfig… as you can balance traffic prior to it hitting the network…

(link)

June 6th, 2008
3:49 PM PT

Oh, and @Liz: Some kind of proxy service died that likely lives in the data center, in front of the app server. That could be an HTTP-aware firewall or something else, likely a custom Amazon thing.

The bigger point here is that there’s always a point of failure, and web systems are a complex mixture of technologies.

June 6th, 2008
4:08 PM PT
Liz said:

Thanks, Alistair. I got lost somewhere there in the middle of Om’s analysis.

June 6th, 2008
4:45 PM PT

Time for all webservices to be 100% transparent on uptime issues. Trust.salesforce.com led the way, and interesting the new Acrobat.com also has a health.acrobat.com service …

June 6th, 2008
5:47 PM PT
ryan said:

Thanks for this uninformed piece. I forwarded it to my friends for a laugh.

To anyone the least bit knowledgeable, it’s pretty obvious that whatever happened was a huge deal. It wasn’t as simple as “some front ends went down”. It was truly an epic fail.

June 6th, 2008
7:12 PM PT
Satya said:

I do not agree on highly complex thing that Amazon is claiming. Replication or load balancing is not rocket science or something which deserves a Nobel prize. Tomcat is giving it for free, though volume of transaction will be low wihout code modification.

Security layer is not the case, you would have got HTTPS headers. Neither is caching layer.

What is most suprising is that apology page did not work, too! It seems not to be a software failure, rather a complete blackout of hardware, which is stunning.

June 6th, 2008
8:10 PM PT
john said:

I think people expect too much from web services. It makes sense to report it as news, but Google News has some 200+ redundant stories about the site going down and still know one has an answer. I’m not going to worry too much about someone not being able to buy a Kindle for a whopping 3 hours!

June 6th, 2008
10:03 PM PT
lodestar said:

This speculation is ridiculous; admitting Amazon’s system is complex and then continuing to guess at causes of the outage is a fool’s game. Without information, which this ‘article’ has none of, you have no idea what did or did not happen.

June 6th, 2008
11:01 PM PT
mel said:

it went down because of this:
(link)

June 6th, 2008
11:11 PM PT
peter b. said:

Wow. Are you sure you want to stick to being a journalist? You seem to have a calling (and perhaps aspire?) to be an operations guys. It would’ve been much more interesting if you had taken the business impact angle.

By the way, have you ever run an operation of such size and complexity or do you just enjoy being a pundit? A very shallow and disappointing piece, I must say.

June 6th, 2008
11:59 PM PT
Seattle-ite said:

Yea, dude, Amazon is a mess. There is not a simple reason, it is literally 5-8 year old custom written perl and some other programming language I forgot.

Every developer who works there has to carry a pager around during a rotation…what does that tell you?

It is actually more surprising that Amazon only goes down as infrequently as it does.

Ask anyone who has worked there, under the hood, it is hardly a pillar of robust well architected software. They got it to work and they pretty much just tack shit on at this point.

June 7th, 2008
5:12 AM PT
Liz said:

Shoot! I meant Alistair’s analysis, not Om’s. I must remember to check the byline.

June 7th, 2008
5:20 AM PT
fehwalker said:

It’s not overly interesting that Amazon’s store was down from an online retailer perspective. It *is* interesting that the people selling EC3, S3, etc, can’t keep their core retail site up. More and more online startups are just thin wrappers around Amazon services, if they can’t keep one of their own sites up why should I have confidence they can keep mine up? Yeah, they’re probably different internal groups, shoemaker’s children, blah blah — that’s their problem, not mine.

June 7th, 2008
6:45 AM PT
Om Malik said:

@ satya,

since we use the wordpress.com platform we are kind of limited in the search abilities by them. However, we are working with them to improve and streamline out search feature and soon will give you ability to search across the network with proper headlines, synopsis and date etc. hang in there buddy.

@Liz, just to clarify, Alistar wrote the story, I didn’t.

June 7th, 2008
6:52 AM PT
geo said:

word on the street: DOS Attack

June 7th, 2008
7:48 AM PT
Satya said:

Thanks Om. Appreciate your response.

You know what - for one content search wrt to GigaOM Show, I had to google and point to your website.

June 7th, 2008
8:08 AM PT
Satya said:

Missed an important one.

I am with Alistair’s article and initial analysis. Your website informed me first on this. And it will help me to influence my circle on decisions regarding the importance of robust software infrastructure. Thanks again.

@Liz, none of my comments were aimed at you or anyone in person.

It is this kind of news, comments and analysis, which makes software so lovable and exciting. And may be that is the reason your website is loved so much also.

Satya

June 7th, 2008
9:53 AM PT

@ryan: Sorry you felt that way. Glad you got a laugh, then.

@mel: I doubt it’s an outage due to a single product, given that they handle peak loads at Christmas and so on, although it’s possible. Remember Amazon got into the AWS business to use up excess capacity it had from such peaks.
Further investigation into the custom headers on the site suggest there’s a proxy layer that wasn’t working; what’s strange is that the layer went down across all IPs (which, admittedly, seem to go to the same router.)

@peterb: From a business standpoint it looks like lots of people have already estimated the cost of downtime based on annual revenues. This analysis ignores two things: The peaks and valleys of purchasing patterns, and the fact that many of those buyers will simply return later. And yeah, I’ve helped run some big sites, FWIW, but most were standard three-tiered environments.

@Geo: Websense and Narus are saying it’s probably not DDOS. The fact that traffic was working (albeit with the wrong HTTP contents) supports this to a degree.

June 7th, 2008
10:53 AM PT
john adams said:

No load balancer that I know of has a configurable “Apology page” when all real servers in the pool are unreachable. To the load balancer, the apology page would be seen as another active server and added to the rotation along with the normal httpd servers. Most of the time maintenance pages are manually brought up and down by the administrators.

You’ve also left out an important, but interesting failure mode in your analysis. There’s a good chance that users were directed (through GSLB and/or standard load balancing) to a set of nodes which had functioning HTTP but a non-functional back end. This is a very difficult situation for the load balancers to detect, as a server is answering the phone but then doing nothing once the “call” has been completed.

Too bad for Amazon, really. People need to take Internet operations far more seriously and understand that Amazon probably shares many of the same problems that Twitter is having right now: Rapid growth from increased demand, scaling, and redundancy.

June 7th, 2008
2:18 PM PT
zulubanshee said:

Wasn’t amazon hit by the massive global botnet attack yesterday? My boss was talking about it. Youtube went down yesterday as well. Ditto imdb.

June 7th, 2008
3:12 PM PT

@john: You’re spot on about the detection, though: Administrators who do a simple HTTP up/down check on their load balancers, rather than looking for known strings in the page, wind up having “valid” but broken pages served to the outside world.
BTW I know lots of sites that have a policy-based apology rule (using something like F5’s iControl) but as you point out, that only works when the load balancer knows something’s awry.

June 7th, 2008
9:46 PM PT
Games Lord said:

we use (link) they provide us with both global dns and dns load balancing, as well as content cache and site redirection. there are other companies like internap and even ultra dns that offer similar products… strange amazon should invest

June 8th, 2008
4:55 AM PT
Rooby Roo said:

@john adams - check out haproxy, keepalived, ldirector, mod_throttle (?)
they all have automated sorry server capabilities. (link) is very sweet in my opinion, lots of features and flexibility and speed.

The web servers or some other monitoring system (monit maybe) should have been able to detect the dead backend servers and remove those from the mix.

I am off to get more salty snacks for the boss…

June 8th, 2008
2:28 PM PT
robotthink said:

Alastair and Geo,

It most certainly was a DOS attack, I assure you.

And Seattle-ite is right about everything he says in his post, and that was in part why the DOS was possible.

June 9th, 2008
10:21 AM PT
Avneesh Balyan said:

Seems like, The site is down again….

I was thinking of Amazon as Google in Shopping domain (item search, reviews etc..).
Time to re-tink???

June 9th, 2008
12:01 PM PT

@John Adams

I think you haven’t been looking around very much. Modern load balancers (application delivery controllers) a la F5 BIG-IP, have configurable “apology” pages when all nodes are down. This technology has been around for quite a while, it’s not something new or unknown in the industry at all.

Lori

June 9th, 2008
12:05 PM PT
John Franks said:

Check out David Scott’s interview at the Business Forum: (link) . It seems Amazon (and a lot of other organizations suffering data thefts, outages, bad projects, etc.) needs his book!

June 9th, 2008
12:31 PM PT
BloggerBen said:

This may be a dumb question, but did the Amazon cloud go down too?

June 10th, 2008
5:41 AM PT

@john adams: Netscaler load balancers (as Amazon is rumored to use) can re-direct to a ‘Sorry Page’ when all backend services are down. It’s a simple config, and we require it on all our load balanced services.

That, and the fact that the HTTP error page that was presented looks just like the one generated by Netscalers when operating in proxy mode, indicates to me that the load balancer layer was up & functional, but there was nothing behind it to which to send traffic, and that there was no redirect enabled.

Having said that, re-directing a high volume site to a sorry page is a challenge itself. We maintain a load balanced pool of servers on a separate pair of load balancers, just to handle the sorry page from the load balanced applications.

–Mike

June 10th, 2008
2:02 PM PT

@Mel: I think that rumor is hilarious too. It wouldn’t surprise me that a bunch of gamers writing scripts to auto-buy available items could crash Amazon. That would be awesome.

June 11th, 2008
7:37 AM PT
Vito said:

I still have the headers in my scroll back buffer:

$ wget -S (link) -O /dev/null
–14:22:58– (link)
=> `/dev/null’
Resolving (link) . 72.21.210.11
Connecting to (link) |72.21.210.11|:80… connected.
HTTP request sent, awaiting response…
HTTP/1.1 503 Service Unavailable
Server: NS_6.1
Content-Length:62
Connection: close
14:22:58 ERROR 503: Service Unavailable.

It appears that they are indeed running Citrix Netscalers (Server: NS_6.1) which is what returned the 503 error you see above.

“Sorry” pages only work if configured; they are not a default. Maybe Amazon hasn’t gotten around to that. ;)

June 13th, 2008
11:14 AM PT

So how do you manage and diagnose such complex systems? You are probably talking about 1,000s of devices that all could be the root cause of the issue. How do you isolate the root cause? Classically, you’d have some type of monitor with rules to detect when certain issues occurred. This MIGHT point you to the right location. However, with IT being the main differentiator to the end customer, new changes are constantly being rolled out. The system is always growing, changing and the customer usage patterns are always altering over time. So the rules originally written tend to get lightened up so that you don’t have alert storms. Now with the rules loosened up, you might not detect the failure and, even if you do, the events will not be as helpful in detecting the root cause. You could throw more and more smart people at the problem and constantly update and maintain your set of rules. However, as the system gets more and more complex this human cost will grow at an enormous rate. You need something different, a tool that automatically detects issues and adjusts to the changing system and usage patterns. You need a tool that uses statistical analytics to weed through the noise of the system and determines the relationship between the business information and the IT information in order to allow you to quickly get to the root cause of issues like this. If you got a single alert that told you that the load balances were getting a higher than average reconnect rate, your number of sales was dropping below normal, your average load on all web servers was way below normal, the number of search transactions was way below normal, the normal of connected users was way below normal etc., you would be able to have a quick idea on where to start. No need to experience all your line of defenses being down again.

June 19th, 2008
12:03 PM PT
Milly Dawson said:

It’s too bad about Amazon going down for a spell but maybe it inspired someone or several someones to check out their local library. I hope so.

Leave a Comment

Get the comments RSS feed, instant notification of new comments

Editorial Masthead

Carolyn Pritchard
Managing Editor
Celeste LeCompte
Special Projects Editor
Om Malik
Senior Writer
Stacey Higginbotham
Staff Writer
Wagner James Au
Contributing Editor
Liz Gannes
Staff Writer
Chris Albrecht
Staff Writer
Katie Fehrenbacher
Staff Writer
Close
E-mail It