Friday, April 22, 2011

Oh no, what have I done! Or: My cloud evangelism got cloudy. Or: The dog ate my network..

At the recent MySQL User Conference, I had a talk on how we at Recorded Future use Amazon EC2 to keep our servers humming (the slides for the talk are available here). And of cource, Amazon EC2 turned back on me (and us all at RF) about a week later. I will not go into details, but somehow, we still don't know exactly why ("The cleaning lady unplugged THE SERVER to plug in the vacuum-cleaner", "The dog ate my network"?).

The thing has been down for 24+ hours now, and there is no end in sight, as far as I can tell. As I said in my talk, we are considering a move to Amazon RDS instead of running our MySQL servers ourselves, and one of my first reactions to this trouble was that we really should have done that already. That was until I realized that the Amazon RDS service was affected as well. Which all goes to show: The more things you ut in one SPOF, then more things will fail when that SPOF fails. And we are not alone, Reddit, Quora and many more in the Amazon us-east1 Availability Zone are in a similar situation. I wonder how the other database HA Solutions for Amazon survived (xeround at all)? Did they do OK or not, if they did, then this would be a selling argument for them. And if I was Rackspace (which I am not), I would launch a compaign right now...

Today we have been trying to set up our services in another Availability Zone. Our EBS disks are no good, but the snapshots are, so we should have something up an running real soon.

Cheers
/Karlsson
Who will stay with EC2, but will look at managing more things myself and to prepare for a solid backup plan, without Amazon intervention (Amazons statement that the Availability Zones are isolated wasn't really true, it seems). And who will not translate HA into "I let Amazon handle that"...

5 comments:

Unknown said...

Thanks for your question, Xeround service remained fully available even when outages occured thanks to our high availablity strategy and distributed technology

Karlsson said...

That's great! Thanx for the input!

/Karlsson

rpbouman said...

Hi Anders!

"I wonder how the other database HA Solutions for Amazon survived"

apparently, one of Amazon's datacenters was affected. There is a very insightful article right here:
http://blogs.gartner.com/lydia_leong/2011/04/21/amazon-outage-and-the-auto-immune-vulnerabilities-of-resiliency/ which suggests Amazon users could actually have shielded themselves from going down. Not sure how feasible that would be, but the article sure is worth a read.

Karlsson said...

Thanx Roland.

As far as the datacenters are concerned, I'm not so sure it is completely true that just one center was affected. What I understand, it originally happened in one center, but that the effects actually migrated and affected some other centers.

I'lll sure read the article, seems interesting when we are to have a look at our options to fix this in the future.

/Karlsson

Karlsson said...

Let me correct myself, before someone else points it out. The Amazon issues did not, it seems, transmit through AZs, and not datacenters, this was me commenting a bit too early in teh morning. So yes, only one datacennter was affected, but multiple AZs were, and these are not so isolatied from easchother as one might think. The Datacenters seems to be though.

Cheers
/Karlsson