Saturday, February 23, 2013

Windows Azure Armageddon

At 12:54 PM on 2/22/2013, I received the following DOWN Alert message from Pingdom for my OakLeaf Systems Azure Table Services Sample Project demo from my Cloud Computing with the Windows Azure Platform book:

image
When I attempted to run the sample Web Role project at http://oakleaf.cloudapp.net, I received an error message for an unhandled exception stating that an expired HTTPS certificate caused the problem. Here’s the Windows Azure team’s explanation:

image

Here’s Pingdom’s UP alert at 8:28 PM last night:

image

Prior to this incident, the sample project had run in the South Central (San Antonio, TX) data center for nine months within the 99.9% availability SLA, as reported in my monthly Uptime Reports. Here’s the latest monthly uptime report data since June, 2011:

Month Year Uptime Downtime Outages Response Time
January 2013 100.00% 00:00:00 0 628 ms
December 2012 100.00% 00:00:00 0 806 ms
November 2012 100.00% 00:00:00 0 745 ms
October 2012 100.00% 00:00:00 0 686 ms
September 2012 100.00% 00:00:00 0 748 ms
August 2012 99.92% 00:35:00 2 684 ms
July 2012 100.00% 00:00:00 0 706 ms
June 2012 100.00% 00:00:00 0 712 ms
May 2012 100.00% 00:00:00 0 775 ms
April 2012 99.28% 05:10:08 12 795 ms
March 2012 99.96% 00:20:00 1 767 ms
February 2012 99.92% 00:35:00 2 729 ms
January 2012 100.00% 00:00:00 0 773 ms
December 2011 100.00% 00:00:00 0 765 ms
November 2011 99.99% 00:05:00 1 708 ms
October 2011 99.99% 00:04:59 1 720 ms
September 2011 99.99% 00:05:00 1 743 ms
August 2011 99.98% 00:09:57 2 687 ms
July 2011 100.00% 00:00:00 0 643 ms
June 2011 100.00% 00:00:00 0 696 ms

Following is the historical report for those services affected by the expired certificate:

image


image
image
image
image

imageIt’s obvious that some minor functionary in the Windows Azure bureaucracy missed an item on his or her todo list yesterday. It’s equally obvious that this is a helluva way to run a cloud service (with apologies to Peter Arno and John Luther (Casey) Jones.)

Adrian Cockcroft (@adrianco) noted that “Azure had a cert outage a year ago” in a 2/23/2013 Tweet:

image

Microsoft’s Bill Liang posted a Summary of Windows Azure Service Disruption on Feb 29th, 2012, which was caused by expiration of a “transfer certificate,” on 3/9/2013:

… So that the application secrets, like certificates, are always encrypted when transmitted over the physical or logical networks, the GA creates a “transfer certificate” when it initializes. The first step the GA takes during the setup of its connection with the HA is to pass the HA the public key version of the transfer certificate. The HA can then encrypt secrets and because only the GA has the private key, only the GA in the target VM can decrypt those secrets.

When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail. …

0 comments: