Friday, December 14, 2007

Amazon Announces Beta of SimpleDB Web Services in the Cloud

Amazon announced in their Amazon SimpleDB™- Limited Beta post of December 14, 2007 that Amazon Web Services, LLC will start selling pay-per-GB space on and traffic to a non-relational, attribute-based database service that's integrated with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2). The database service supports REST and SOAP request/response operations.

Updated 12/15/2007, 12/16/2007, 12/18/2007, and 12/21/2007

A SimpleDB domain is a schemaless container (table or catalog) for items which contain values and has a maximum size of 10 GB. Items can be added to or removed from the domain without affecting other items.

Items are hash tables of attribute-value pairs that correspond to columns. Item name and attribute values can range from 1 to 1,024 characters in length. A domain can contain a maximum of 256 items.

Attribute values (cells) are strings that the service indexes automatically as you add them; UTF-8 string is the sole data type. This means you must left-pad integers with zeros, offset negative integers, and use ISO 8601 (yyyy-mm-dd) or similar formats for dates to maintain lexicographical order. A domain is limited to 250 million cells.

Attributes can have multiple values, which gives SimpleDB a directory-like (e.g. LDAP) flavor but no hierarchy. However, it might be suitable as a central host for an identity system like OpenID. (This assumes partitioning could overcome the 250-million cell limit.)

Here's a sample nine-item domain from the Getting Started Guide's "Putting Data into a Domain" topic:

ID Category Subcat. Name Color Size Make Model Year
Item_01 Clothes Sweater Cathair Sweater Siamese Small, Medium, Large      
Item_02 Clothes Pants Designer Jeans Paisley, Acid Wash 30x32, 32x32, 32x34      
Item_03 Clothes Pants Sweat-pants Blue, Yellow, Pink Large     2006, 2007
Item_04 Car Parts Engine Turbos     Audi S4 2000, 2001, 2002
Item_05 Car Parts Emissions 02 Sensor     Audi S4 2000, 2001, 2002

Charles Ying's What You Need To Know About Amazon SimpleDB post of December 13, 2007 provides an alpha user's insights about SimpleDB. Ying reports that SimpleDB is written in Erlang, which is a general-purpose concurrent programming language and runtime system that is designed for highly-parallel data processing. The Product: Amazon's SimpleDB and The Current Pros and Cons List for SimpleDB on the High-Scalability site provides links to and a summary of others' opinions of SimpleDB and its architecture.

Amazon S3 is designed for storing relatively large objects; SimpleDB is designed for fast, indexed access to small objects. SimpleDB can interact with S3 to create, for example, a SimpleDB index to large objects stored in S3.

SimpleDB's REST API uses GET HTTP requests to return Plain Old XML (POX) responses. Following is a sample GET query request from the Developer Guide's "Query" topic:

https://sdb.amazonaws.com/
?Action=Query
&AWSAccessKeyId=[valid access key id]
&DomainName=MyDomain
&MaxNumberOfItems=3
&NextToken=[valid next token]
&QueryExpression=%5B%27Color%27%3D%27Blue%27%5D
&SignatureVersion=1
&Timestamp=2007-06-25T15%3A03%3A09-07%3A00
&Version=2007-11-07
&Signature=2wVXB1x0NSWWETwLylZPVP%2FtqXQ%3D

and the POX response:

<QueryResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07">
  <QueryResult>
    <ItemName>eID001</ItemName>
    <ItemName>eID002</ItemName>
    <ItemName>eID003</ItemName>
  </QueryResult>
  <ResponseMetadata>
    <RequestId>c74ef8c8-77ff-4d5e-b60b-097c77c1c266</RequestId>
    <BoxUsage>0.0000219907</BoxUsage>
  </ResponseMetadata>
</QueryResponse>

(I still believe the Astoria team should support POX responses from URI GET queries.)

Query operators include =, !=, <, > <=, >=, STARTS-WITH,  AND, OR, NOT, INTERSECTION AND UNION. Queries return groups of results; the size of the group is determined by the MaxNumberOfItems query parameter, which has a maximum of 250 and defaults to 100. The NextToken value determines the starting point of the group.

Update 12/21/2007: Dare Obasanjo weighs in with his Amazon SimpleDB: The Good, the Bad and the Ugly post of December 21, 2007. He likes "Comoditizing hosted services and getting people to think outside the relational database box," dislikes "Eventual Consistency and Data Values are Weakly Typed," and detests "Web Interfaces, that Claim to be RESTful but Aren’t." I agree with Dare on all topics except eventual consistency. GET requests for data updates are atrocious.

Replication between nodes means that newly uploaded or modified data won't be visible immediately to all datacenter connections that have domain copies. Developers must abandon Atomicity, Consistency, Isolation, Durability (ACID) transaction concepts and accept Basically Available Soft-state Eventually-consistent (BASE) behavior. (I believe eBay Technical Fellow, Dan Pritchett originated the BASE acronym in his "The Challenges of Latency" InfoQ article of May 2, 2007; Michael Nygard discusses BASE in his November 9, 2007 Architecting for Latency post.)

Amazon's Getting Started Guide includes C# and VB sample code for the REST API.

Question: ADO.NET Data Services (Project Astoria) clients have LINQ to REST in CTP1. How long will it be until some enterprising LINQ addict creates LINQ for SimpleDB?

Backstory: There are similarities between Amazon's approach to the attribute-based database structure and Google Base but Google (not the developer) specifies the attribute names and their data formats. The GDataAPI for Google Base uses an extended Atom Pub protocol. SimpleDB's freeform database design and simple REST URI and SOAP APIs are nice touches for developers, but Atom Pub might have been a better choice for both REST queries and data entry/update.

Following are a couple of quotes about large-scale databases in the cloud from my Very Large Databases: Bricks, BitVault and BigTable post of April 6, 2006:

Dare Obasanjo's "Greg Linden on SQL Databases and Internet-Scale Applications" post quotes Greg Linden:

"What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in an out of service at will, read-write replication and data migration all done automatically. I want to be able to install a database on a server cloud and use it like it was all running on one machine."

Greg Linden wants a relational database, which SimpleDB isn't, but otherwise it fits most of his requirements. It's interesting to note that Greg "was at Amazon.com from 1997 to 2002 where [he] wrote the recommendation engine used by Amazon.com and later led the software team that developed Amazon's personalization systems."

Adam Bosworth, who managed the initial development of Microsoft Access, went on to found CrossGain (together with Tod Neilsen, Access marketing honcho), sold CrossGain to BEA and became BEA's chief architect and senior vice president, and now is VP Engineering at Google, lists these three features that database users want but database vendors don't suppply: Dynamic schema, dynamic partitioning of data across large dynamic numbers of machines, and modern [Googlesque] indexing. Adam wants the the Open Source community to "[g]ive us systems that scale linearly, are flexible and dynamically reconfigurable and load balanced and easy to use." Adam does mean give, not sell.

It's my opinion that SimpleDB would meet almost all of Adam's requirements (except free).

GigaOm and TechCrunch have columns today on SimpleDB. Neither analysis seems to me to be on the mark.

Note: There are numerous online databases and forms applications, a few of which are free. For example, blist is a recent startup with plans to provide a scalable database in the cloud that anyone can set up and use. blist appears to share SimpleDB's attribute value flexibility. It's not yet in open beta so I haven't been able to evaluate it or learn anything about blist's business plan. Apparently blist will compete with DabbleDB and other online (Web-based) database services. My Dabble DB: The New Look in Web Databases post of March 19, 2006 analyzed the initial DabbleDB release in the light of its then competitors and Ray Ozzie's widely circulated "Internet Services Disruption" memo of October 28, 2005.

Dave Winer says in his Amazon removes the database scaling wall post of December 15, 2007:

It's amazing that Microsoft and Google are sitting by and letting Amazon take all this ground in developer-land without even a hint of a response. It seems likely they have something in the works. Let's hope there's some compatibility.

Suggestion: It would be very interesting if Microsoft decided to provide developers multiple free .NET Data Services (nee Astoria Project) with SQL Server back ends limited to a total of 1 GB or so, similar to the amount of free space offered by Windows Live's SkyDrive. (The current limit for a "Create Your Own Online Data Service" at astoria.mslivelabs.com is one 100MB database, which is not generous—even for a free service.) Start charging for storage over 1 GB and traffic in 100-MB or 1-GB per month increments. Alternatively, clone Amazon's pricing model.

On the whole, I believe most developers will be more interested in a reliable, scalable relational store "in the cloud" than SimpleDB's attribute-based model. The 1,024-char maximum length of an attribute value seems to me to be a serious limitation. However, it might be overcome by a link to an S3 object, assuming no serious performance hit.

Update 2/14/2008: See Dare Obasanjo's Amazon SimpleDB: The Good, the Bad and the Ugly post of December 21, 2007 for another analysis of SimpleDB's feature set.

Justin Etheredge released LINQ to SimpleDB Alpha 1 (source code and runtime binary) to CodePlex on January 23, 2008 under the Microsoft Public License (Ms-PL). His LinqToSimpleDB Preview post of January 19, 2008 offers operating instructions. See also my LINQ and Entity Framework Posts for 2/11/2008+ post's "Justin Etheredge Offers Preview of LINQ to [Amazon] SimpleDB" topic.

1 comments:

AT said...

Hi Roger,

Since your blog is focused on data services and .NET you may be interested in Simple Savant, an open-source C# interface to SimpleDB. Simple Savant incorporates ADO.NET-style parameterized selects, easy property to attribute mapping, type formatting to support lexicographical sorts and searches, and a number of other features.

Regards,
Ashley