Wednesday, March 18, 2009

Azure Services Outage 3/13/2009 – A Brief History

This post describes an Azure Services outage of about 22 hours duration that started at approximately 3/13/2009 10:30 PM PDT.

The problem was resolved at about 3/15/2009 8:24 PM PDT. Updates below are in reverse chronological order. The outage affected many, but not all, users.

Update 3/18/2009 8:30 AM PDT: The Azure Team posted The Windows Azure Malfunction This Weekend on 3/17/2009 at 7:34 PM PDT. The post contains an explanation of the outage’s cause and describes how the team intends to ameliorate the effects of similar events in the future.

Steve Marx posted from MIX 09 a mea culpa about lack of his communication during and after the outage: What I Learned From the Windows Azure Malfunction (3/17/2009 8:05 PM PDT)

According to the team post, a networking problem during a routine software update interfered with hosted project deployment and caused a large number of servers to go offline simultaneously. The Fabric Controller was intentionally slow to respond to a bulk request to provision instances to downed projects, so the team performed a parallel update. The team recommends that testers run two instances of their hosted services. The second instance won’t be included in quota calculations.

I changed the ServiceConfiguration.cscfg file’s <Instances count=”1” /> element of my two demo Web Role projects:

  1. Table Test Harness  AppID = 000000004C002F22
  2. Blob Test Harness AppID = 000000004C006446

to <Instances count=”2” /> in the Azure Services Developer Portal‘s configuration mode.

Both services’ status changed to Updating for a minute or two and then showed:

  1. Initializing 2 (indicating initializing 2 instances)
  2. Started 1 Initializing 1 (one instance each)

During this time, both services behaved normally. Project #1 then changed to Started 1 Initializing 1 after about 30 minutes. Both services remained in this state thereafter:

Analytics’ Hourly Virtual Machine Usage (Hours) display shows that both instances of the two services are running:

Both projects remained in the above state for four hours (so far).

This post was transferred from Windows Azure and Cloud Computing Posts for 3/9/2009+ on 3/15/2009 at 9:00 AM PST. I’ll update the post when the Azure team publishes the cause of the outage.

Update 3/14/2009 8:24 PM PDT: Steve Marx:

UPDATE [8:24PM PDT]: This issue is resolved.  Windows Azure is operating normally again.

When the whole team's back in the office on Monday, I'm sure we'll do a root cause analysis to understand exactly what went wrong and what we need to do to ensure it doesn't happen again.

Once we have that sorted out, I'll put together a summary.  (Probably won't get that out until after the MIX conference next week.)

Update 3/14/2009 3:36 PM PDT: Steve Marx:

We've identified and verified a recovery process that we're just now applying throughout the cloud.  ETA is five hours to complete recovery.  I'll post again when everything's back to normal.

My two Azure demo apps (blobs and tables are back up and running at 4:30 PM PDT.

According to Steve Marx’s Windows Azure Outage message of 3/14/2009 10:30 AM PDT:

Windows Azure is currently experiencing an outage.  We are investigating but do not yet have an ETA for a resolution.  A large number of deployments are currently offline, and are slow to restart.

What is affected: Applications may be unreachable or in "stopped" or "initializing" states for long periods of time.

When the outage began: About 10:30pm PST last night.

Who is affected: Potentially anyone currently running an application in Windows Azure.

We will post updates to this thread throughout the day as we investigate and resolve the outage.  There is currently no ETA for a fix.

The OakLeaf Table Services and Blob Services demos are offline in the Initializing (1) state. Both demos were operating normally yesterday afternoon. Analytics shows they went down at about 3/14/2009 6:00 PM UCT, assuming Analytics has switched from PDT to UCT. (Steve’s Botomatic app is operational – see “Live Windows Azure Apps, Tools and Test Harnesses” section.)


Matt Rogers @ MSFT said...

We recently posted an update on the Windows Azure team blog discussing the cause of the malfunction. In the future we are encouraging users of the CTP service to deploy their application with multiple instances of each role, and we will not count the additional instances against CTP quotas. Recently we have been finalizing our processes for notifying users of important events which we planned to roll out soon--several months ahead of commercial availability. That work is now accelerated.

TC said...

Reading between the lines, I suspect the sophistication of the Azure Fabric Controller was the root cause behind such a long outage.

I have blogged on this subject at: