Wednesday, October 21, 2009

Unscheduled 50-Minute Outage of My Windows Azure Apps in USA-Northwest on 10/16/2009 6:00 AM PT

I frequently see questions regarding monitoring the uptime of applications and data services running on the Azure Services Platform, so I decided to test three monitoring services that offer no-charge versions with my live Azure test harnesses.

Updated 10/21/2009 with Summary Report from Mon.itor.us for 10/12 through 10/18/2009 (see end of post.)

Updated 10/16/2009 with Julie Lerman’s Twitter commentary on her Azure problems this morning (see end of post.)

I’m running the following monitoring services on three of my live Azure Data Services test harnesses that my Cloud Computing with the Windows Azure Platform book describes:

Chapter 4: OakLeaf Systems Azure Table Services Sample Project

Chapter 4: OakLeaf Systems Azure Blob Services Test Harness

Chapter 8: OakLeaf Systems: Photo Gallery Azure Queue Test Harness

The test projects currently run in the USA-Northwest data center and have two instances specified to avoid outages during Production Fabric upgrades.

IsMyServerUp

IsMyServerUp reported problems with the Blob Services Test Harness on 10/15/2009 with Twitter direct messages sent at at 5:45, 6:00 and 6:16 PM PT; IsMyServerUp tests at 15-minute intervals. Mon.itor.us didn’t report an outage but it tests at one-hour intervals, so might have missed the outage. Pingdom didn’t observe an outage because it tests the Table services only.

IsMyServerUp reported problems with the Blob Services Test Harness on 10/16/2009 at 5:58 AM and 6:00 AM and a problem with the Azure Table Services Sample Project at 5:59 AM. Here’s a screen capture of IsMyServerUp’s report as of 10/16/2009 at 10:00 AM PT:

Here’s a capture of the last seven direct messages sent by IsMyServerUp as of 11:50 AM PT:

 

Mon.itor.us

Mon.itor.us reported a problem with the Blob Services Test Harness on 10/16/2009 at 5:58 AM PT and recovery at 6:21 AM PT. It also reported the Photo Gallery Azure Queue Test Harness down at 7:01 AM and the Azure Table Services Sample Project down at 7:05 AM PT. Following is a capture of Mon.itor.us’ summary and individual service reports for the early morning of 10/16/2009:

Mon.itor.us appears to have missed the Queue Test Harness outage reported by IsMyServerUp about an hour after the Blob Test Harness outage.

Pingdom.com

Pingdom reported the Table Services Sample Project down since 5:58:21 AM at 6:19 AM PT and “Azure Tables (oakleaf.cloudapp.net) is UP again at 10/16/2009 06:48:21AM, after 50m of downtime” at 6:48 AM. Here’s Pingdom’s report for the the early morning of 10/16/2009:

Commentary

I have not yet seen specific reports by others of the outages reported here, other than Julie Lerman’s problems this morning as noted in this composed Tweet collection:

If Julie had been running a monitoring service, she might have saved the “Wasted hour.”

There are minor inconsistencies between the three reports, but it appears that all three monitors would be useful for initial testing of the Azure Services Platform Web and Data Services availability. IsMyServerUp is my favorite, because it reports the last error message details. However, if I were to use a monitoring service to back up claims for Service Level Agreement (SLA) breaches, I would pick Pingdom because of its maturity and widespread acceptance by the site hosting industry.

Update 10/21/2009: Summary Report from Mon.itor.us for 10/12 through 10/18/2009

Test

Type

Tag

Uptime

Resp Time

OK

NOK

oakleaf.cloudapp.net

http

AzureTable

99.70%

616.42

668

2

oakleaf2.cloudapp.net/Default.aspx

http

AzureBlob

99.25%

1047.44

664

5

oakleaf5.cloudapp.net

http

AzureQueue

100.00%

399.74

671

0

www.google.com/

http

Benchmark

100.00%

93.44

662

0

Total Average

 

 

99.74%

539.34

2665

7

 

blog comments powered by Disqus