Windows Azure Armageddon
At 12:54 PM on 2/22/2013, I received the following DOWN Alert message from Pingdom for my OakLeaf Systems Azure Table Services Sample Project demo from my Cloud Computing with the Windows Azure Platform book:
When I attempted to run the sample Web Role project at http://oakleaf.cloudapp.net, I received an error message for an unhandled exception stating that an expired HTTPS certificate caused the problem. Here’s the Windows Azure team’s explanation:
Here’s Pingdom’s UP alert at 8:28 PM last night:
Prior to this incident, the sample project had run in the South Central (San Antonio, TX) data center for nine months within the 99.9% availability SLA, as reported in my monthly Uptime Reports. Here’s the latest monthly uptime report data since June, 2011:
Month | Year | Uptime | Downtime | Outages | Response Time |
January | 2013 | 100.00% | 00:00:00 | 0 | 628 ms |
December | 2012 | 100.00% | 00:00:00 | 0 | 806 ms |
November | 2012 | 100.00% | 00:00:00 | 0 | 745 ms |
October | 2012 | 100.00% | 00:00:00 | 0 | 686 ms |
September | 2012 | 100.00% | 00:00:00 | 0 | 748 ms |
August | 2012 | 99.92% | 00:35:00 | 2 | 684 ms |
July | 2012 | 100.00% | 00:00:00 | 0 | 706 ms |
June | 2012 | 100.00% | 00:00:00 | 0 | 712 ms |
May | 2012 | 100.00% | 00:00:00 | 0 | 775 ms |
April | 2012 | 99.28% | 05:10:08 | 12 | 795 ms |
March | 2012 | 99.96% | 00:20:00 | 1 | 767 ms |
February | 2012 | 99.92% | 00:35:00 | 2 | 729 ms |
January | 2012 | 100.00% | 00:00:00 | 0 | 773 ms |
December | 2011 | 100.00% | 00:00:00 | 0 | 765 ms |
November | 2011 | 99.99% | 00:05:00 | 1 | 708 ms |
October | 2011 | 99.99% | 00:04:59 | 1 | 720 ms |
September | 2011 | 99.99% | 00:05:00 | 1 | 743 ms |
August | 2011 | 99.98% | 00:09:57 | 2 | 687 ms |
July | 2011 | 100.00% | 00:00:00 | 0 | 643 ms |
June | 2011 | 100.00% | 00:00:00 | 0 | 696 ms |
Following is the historical report for those services affected by the expired certificate:
It’s obvious that some minor functionary in the Windows Azure bureaucracy missed an item on his or her todo list yesterday. It’s equally obvious that this is a helluva way to run a cloud service (with apologies to Peter Arno and John Luther (Casey) Jones.)
Adrian Cockcroft (@adrianco) noted that “Azure had a cert outage a year ago” in a 2/23/2013 Tweet:
Microsoft’s Bill Liang posted a Summary of Windows Azure Service Disruption on Feb 29th, 2012, which was caused by expiration of a “transfer certificate,” on 3/9/2013:
… So that the application secrets, like certificates, are always encrypted when transmitted over the physical or logical networks, the GA creates a “transfer certificate” when it initializes. The first step the GA takes during the setup of its connection with the HA is to pass the HA the public key version of the transfer certificate. The HA can then encrypt secrets and because only the GA has the private key, only the GA in the target VM can decrypt those secrets.
…
When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail. …
0 comments:
Post a Comment