Friday, December 05, 2008

SimpleDB Drops Dead at ~1:45 PM PST on 12/4/2008

While I was implementing the Update Customers feature of my latest test harness that substitutes Amazon SimpleDB as the data source for local and Amazon EC2 deployment, SimpleDB started returning 500 Internal Error messages instead of expected responses. I noticed the problem at between ~1:45 and ~2:00 PM PST on 12/4/2008.

Update 12/5/2008 7:00 AM PST: Service returned to normal at ~5:00 PM PST but my test harness/domain were afflicted with SimpleDB Scrambled Attribute Disease (SAD).

Update 12/5/2008 1:00 PM PST: Adds Amazon’s report about the cause of the problem (see end of post) and mentions post with workaround for SAD.

Here’s a snapshot of the AWS Status page at about 2:45 PM (click images for full-scale captures):

You can read about other users experiences in the Receiving InternalErrors from SimpleDB thread of the Amazon SimpleDB (Beta) forum. I’m surprised that there weren’t more complaints in the thread.

The test harness I’m building is a clone of the Azure Table Services test harness described in Azure Storage Services Test Harness: Table Services 1 – Introduction and Overview and subsequent episodes. You can test most features of the Azure version by clicking here. The intent of the three test harnesses is to compare the performance of:

  1. Azure Table Services in Developer Table and Developer Fabric mode
  2. Azure Table Services in Azure Table and Developer Fabric mode
  3. Azure Table Services in Azure Table and Azure Fabric mode
  4. ASP.NET application deployed locally with local SQL Server Express instance (baseline)
  5. ASP.NET application deployed to Amazon EC2 with EC2 SQL Server Express instance
  6. ASP.NET application deployed locally with SimpleDB data source
  7. ASP.NET application deployed to Amazon EC2 with SimpleDB data source

Here’s the EC2/SimpleDB test harness’s UI:

[Notice the substitution of the City attribute value for ContactName, Phone for City, and PostalCode for Phone for the ALFKI item (entity) in the above screen capture. This was the first occurrence of what I call SimpleDB Scrambled Attribute Disease (SAD). See the end of this post for a more egregious example.]

Update 12/4/2008 3:30 PM PST: I found no evidence that “Error rates are beginning to decline.” In fact, at 3:30 PM PST, the status indicator reverted from a bogus “Service is operating normally” (green) state to an equally erroneous “Performance issues” (yellow) state as shown below:

I’m encountering 100% errors on ListDomainRequest() C# method calls with three retries. That spells Service disruption (red) to me.

Update 12/4/2008 4:30 PM PST:

My app starts with ListDomainRequest() to create a domain if the “Customers” domain isn’t present. Thus I need both ListDomainRequest() and CreateDomainRequest() methods operational because the service went down while I was executing a Delete/Recreate Domain operation. (The domain was deleted but not recreated.)

Update 12/4/2008 5:30 PM PST: SimpleDB came back up at ~5:30 PM but somehow scrambled the sequence of entities. Here’s what the test harness looked like after executing the ListDomainRequest() and CreateDomainRequest() methods, followed by a QueryWithAttributesRequest() for the first 20 Customers items:

Notice that ContactTitle values have moved to the CustomerID column, CustomerID to CompanyName, Address to ContactTitle, and PostalCode to Fax for all entities, not just the first, as in the earlier screen capture, which has a different transposition. This isn’t the first time I’ve encountered the problem, but it’s the first time the problem affected more than one entity.

I’m in the process of writing a separate post with more details about SimpleDB Scrambled Attributes Disease.

The Attempts to Cure SimpleDB’s Scrambled Attribute Disease post of 12/5/2008 details the problems with attribute sequence as a moving target.

Stay tuned for more updates if Amazon chooses to say anything about the outage.

Update 12/5/2008 1:00 PM PST: Matt@AWS noted in the Receiving InternalErrors from SimpleDB thread:

6:21 PM PST  To close the loop, at 1:47 PM PST, while performing an administrative operation on our domain mapping system, we experienced several connectivity issues and this combination of events caused a portion of the domain directory to become inaccessible. This affected all CreateDomain, ListDomain, DeleteDomain API requests and a high percentage of PutAttributes, DeleteAttributes, and GetAttributes API requests for domains in the inaccessible portion of the directory. Later this evening, we will deploy a fix to our directory connectivity software that will prevent reoccurrence of this issue.

The fact that only a part of the domain directory died probably is responsible for the few trouble reports on the Internet.