Tuesday, May 06, 2008

Comparing Google App Engine, Amazon SimpleDB and Microsoft SQL Server Data Services

Updates: 5/6/2008: Added parts 8.1 and 8.2 of Google App Engine Terms and Conditions and I became a member of the GAE beta. 4/22/2008: Corrected SSDS SOAP protocol security to optional SSL. 4/19/2008: Added sections on Service Level Agreements (SLAs) and on-premises installation. 4/18/2008: Added missing "Latency" sections to API descriptions, other minor updates.

Introduction

The entry of the Google App Engine into the "Data Stores in the Cloud" arena on April 7, 2008 increases the number of high-profile Storage as a Service (StaaS?) players to three. Here's are brief highlights of the three entrants' features in the order of their arrival as beta versions:

Amazon SimpleDB

SimpleDB is a non-relational data store (tuplespace) running under Erlang that offers only a string data type. SimpleDB offers a REST-style protocol for CRUD (create, retrieve, update, and delete) operations on entities with an arcane syntax that uses the HTTP GET request with an Action parameter for CUD operations, which is controversial. A SOAP API that supports the WS-Security specification is available also.

My Amazon Announces Beta of SimpleDB Web Services in the Cloud post of December 14, 2007 (last updated 2/14/2008) has more SimpleDB details.

Amazon Web Services Blog

Microsoft SSDS

SQL Server Data Services (SSDS) is a non-relational store built on SQL Server 2008 and its sparse columns feature that offers string, decimal, dateTime, base64Binary, and boolean data types. SSDS offers a REST-style API for CRUD operations with REST- and SOAP-based APIs. The REST protocol supports HTTP GET, POST, PUT, and DELETE methods. SOAP requests are limited to the basicHttp protocol (i.e., no support for the WS-* specifications.) SOAP requests use a LINQ-like query syntax. A future version will support the Microsoft Synchronization Framework.

My SQL Server Data Services to Deliver Entities from the Cloud post of March 7, 2008 describes the initial SSDS beta version announced at MIX 08.

Microsoft SQL Server Data Services Blog

Google App Engine

Google App Engine (GAE) includes a non-relational data store built on Bigtable and the Google File System that offers str(ing), unicode, bool, int, long, float, datetime, list, db.Key, db.Blog, db.Text, db.Category, and several other GData data types. GAE is unique in its implementation of Python 2.5.2 for creating custom Web applications and support for the Django templating library. GAE supports HTTP GET, POST, PUT, DELETE, OPTIONS, and HEAD methods, but has no REST or SOAP API for CRUD operations.

Update 5/6/2008: GAE currently does not support SSL for application domains and requires uses to license their content (see the "Data Security" section at the end of this post.) Thus GAE isn't likely to be used to store or deliver confidential business information.

My Rumor: Google to Compete with SimpleDB and SSDS with Bigtable as a Web Service post of April 8, 2008 (last updated 4/10/2008) covers the early days of GAE's entry into the "Datastore in the Cloud" market.

Google App Engine Blog

The three services were in limited beta as of the date of this post. I have access to SSDS and GAE as a beta participant. I also use the local GAE emulator with Python 2.5.2 and the GAE SDK.

Data Models

All three services use the Entity-Attribute-Value (EAV) model, which Wikipedia describes as:

[A] data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest.

and claim to be schemaless. None of the services support JOINs between entities and GAE is the only service with a Reference data type to emulate many:one relationships.

Amazon SimpleDB

Here's SimpleDB's hierarchical data model:

  • Domains correspond to relational tables or entity collections and are a container for a collection of Items; beta users are limited to a maximum of 100 Domains. The maximum size of a domain is 100 MB.
  • Items are collections of Attribute-Value pairs and correspond to rows of a sparse table or an entity instance. The first column (ItemName, an entity key), which may have any Attribute name, uniquely identifies the Item. Queries are limited to returning a maximum of 1,000 items.
  • Attributes are properties of Items and correspond to sparse columns; Attribute names must be unique within an item. Items may have as few as one (ItemName) or as many as 256 Attributes.
  • Values are property values; attributes support multiple values; attribute values are limited to 1,024 UTF-8 characters. All values are indexed as they are added.

Primitive Data Type: UTF-8 string

Comments: SimpleDB's 256-Attribute and 1,024-character value-length limit limits are surprising for an EAV database. The lack of data types other than UTF-8 strings makes dealing with numerical values very difficult; numbers must be left padded with zeros to support values greater than nine, and negative numbers require an offset which clients must translate to real values. Values can point to blobs stored on Amazon Web Service's Simple Storage Service (S3).

Note: Because attribute support multiple values, items can be considered tuples of arbitrary arity; therefore, SimpleDB is better characterized as a tuplespace than a database. Ralf Westphal has written the NSimpleDB emulator for .NET that you can use in a manner similar to the local Google App Engine emulator/SDK. The emulator uses the commercial VistaDB database as its data store.

Microsoft SSDS

Here's SSDS's three-level containment model:

  • Authorities are unique domain names (e.g., oakleaf1.data.beta.mssds.com) that correspond to a database or namespace and have a unique AuthorityID (oakleaf1 for this example.) Authorities are the unit of billing for service usage and hold security tokens. Authorities have a predefined read-only Version value.
  • Containers are collection of Entities (Entity set), have a unique ContainerID, and correspond to a SQL Server 2008 sparse table. The maximum size of a container is 2 GB; larger data sets can be partitioned into multiple containers. There's no limit to the number of containers in an Authority or number of Entities in a container; size is the only constraint. Containers have a predefined read-only Version value.
  • Entities are collections of attribute-value pairs and correspond to a row of a container. A unique Id column identifies an individual Entity, which has a maximum size of 2 MB. (2 MB is the maximum size of the request/response message size.) Id, optional Kind (entity type) and read-only Version (timestamp) columns constitute the predefined metadata attributes. Blobs are stored in base64Binary attributes. Additional data types might be supported in later beta versions.
  • Flexible Properties are arbitrary attribute-value pairs assigned to one or more Entities and stored in sparse columns. There is no limit to the number of flexible properties you can add within the 2 MB size limit of the largest Entity instance. The SQL Server Data Services FAQ states that "Microsoft expects to add "binary large object columns, full text search, and richer data types" features over time.

Primitive Data Types: string, decimal, dateTime, boolean and base64Binary

Note: Future versions will adopt the Entity Framework's Entity Data Model (EDM), which is the preferred data provider for ADO.NET Data Services (a.k.a. Project Astoria). Different teams developed Astoria and SSDS concurrently but their data models diverged. Francois Ajenstadt, director of project management for SQL Server, said in early March 2008, "The goal is to bring the APIs between SSDS and Astoria closer together closer to Release," which implies use of the EDM by the currently scheduled release date (first half of 2009.)

Update 4/18/2008: Pablo Castro, Astoria data architect, posted Astoria Online Service --> SQL Server Data Services, which describes Microsoft's unification plans for Astoria and SSDS, on April 14, 2008.

Comments: The 2 GB size limit on Containers is surprising when you consider that much of the use of SSDS will be for storing slowly-changing catalog or archive information. Compounding the problem is that SSDS doesn't support cross-container queries, which requires applications to emulate UNION operations on the client (. Users must craft their queries carefully because the LINQ query syntax doesn't support the Take() and Skip() methods. Attributes of all data types except binary are indexed on creation.

Google App Engine

The Google Datastore operates with this data model:

  • Model classes represent the metadata for each Entity Group, which is a container for Entities of the Model's kind. The Model defines attributes by subclasses of the Property class.
  • Entity Groups represent collections of Entities of a particular Kind (entity type) that are named for the model, correspond to rows of a sparse table, and are capable of being updated in a transaction.
  • Entities are the basic unit of data storage, defined by a model, and identified by an Entity Key, which is made up of a string key_name or numeric ID, Kind, and path through ancestors (parents) to the Datastore root. The constructor for the Entity will assign a numeric ID if a key_name isn't specified and can assign another Entity as the new entity's parent. (Numeric IDs usually are—but aren't guaranteed to be—assigned in numeric order.) Entity Key values are immutable.

Primitive Data Types: str(ing), unicode, bool, int, long, float, datetime, list (of supported data types)

Additional Types (supported by the google.appengine.ext.db module): db.Key, db.Blob, db.Text, db.Category, db.Link, db.Email, db.GeoPt, db.IM, db.PhoneNumber, db.PostalAddress, db.Rating, plus users.User for a user with a Google account. Most of the additional types have GData counterparts (see Types and Property Classes).

Note: db.Key is the value type for the ReferenceProperty and SelfReferenceProperty classes that emulate many:one associations with db.Key and one:many associations with list(db.Key).

Comments: GAE's Datastore API has a remarkably rich set of data types. Attributes having primitive data types (except list) are indexed when initially queried. The preview release limits each developer to three applications. Usage fees occur after an application exceeds 500MB of persistent storage or bandwidth or CPU resources corresponding to about 5 million monthly page views. See the "Limitations" topic of "APIs:Google App Engine" for details.

APIs

Amazon SimpleDB

Query and update semantics: HTTP GET query strings with Action methods CreateDomain, DeleteDomain, ListDomains, PutAttributes, DeleteAttributes, GetAttributes, and Query.   Queries support these operators: =, !=, <, > <=, >=, STARTS-WITH,  AND, OR, NOT, INTERSECTION AND UNION. Examples of simple queries are [‘Title’ = ‘The Right Stuff’], [‘Year’ > ‘1985’], and [‘Rating’ starts-with ‘****’]. See the SimpleDB WSDL file for additional details. There are strong objections by members of the REST community to the use of the HTTP GET method to modify data.

Security: SSL is optional.

Authentication: REST authentication uses HMAC-SHA1 Signatures tokens generated from strings of parameter names and values encoded with the user's Amazon Web Services (AWS) secret key. WS-Security authentication for SOAP uses an X.509 certificate provided by AWS or a certification authority (CA). SOAP 1.1 without WS-Security uses the REST-style tokens in the SOAP header.

Service-Level Agreements: Not specified to date. Amazon S3 has a 99.9% service level guarantee, with payments of 10% of amount due for billing cycle in which the service level was below 99.9% and above 99% and 25% for a service level below 99.9%. AWS suffered a major-scale outage in February 2008.

Latency: Latency results from replication of data across multiple nodes, which leads to "eventual consistency." Charles Ying says the following about SimpleDB's latency:

Eventual Consistency - Data is not immediately propagated across all nodes… the latency is usually around a second, but for high data sets or loads, you may experience more latency. On the plus side, your data isn’t lost!

Werner Vogels, Amazon's CTO, analyzes the issue in his Eventually Consistent post of December 19, 2007.

On-Premises Implementation: The NSimpleDB emulator for .NET uses VistaDB, which isn't a highly scalable database.

Limitations: Query execution time currently is limited to 5 seconds.

Microsoft SSDS

Query and update semantics: Client library translates string-based LINQ queries to the ACE (Authority-Container-Entity) model for HTTP GET, POST, PUT and DELETE methods. SSDS supports a small subset of the LINQ Standard Query Operators, but not Take() and Skip(). Updates use XML templates with POST, PUT and DELETE methods.

Security: SSL is mandatory for the REST protocol and optional for SOAP. The SOAP interface uses the basicHttp protocol, defaults to clear text, and doesn't support WS-Security or other WS-* specifications. Enabling SSL requires a minor change to the default App.config or Web.config basicHttpBinding settings.

Authentication: Basic (user name and password encrypted by SSL) enables full CRUD permissions for the Authority owner. The initial beta version doesn't support other security principles and roles or anonymous access. (Anonymous access probably isn't practical for a metered service.)

Service-Level Agreement: Not specified to date. Data backups are stored in each cluster; geo-redundant data copies provide disaster recovery.

Response format: Plain old XML (POX). JSON and Atom Publishing Protocol are scheduled for future versions.

Latency: Microsoft claims SSDS's data replication methodology has no latency and delivers immediate consistency for data within a Container, which is the unit of consistency.

On-Premises Implementation: From SQL Server Data Services FAQ:

Microsoft will also offer a library that will enable SSDS for on-premise SQL Server databases. This will enable businesses with the use of both on-premise data platform and services-based SSDS capabilities.

Limitations: Responses are limited to the lesser of a 500-record page or 2 MB. Missing Take() and Skip() operators limit paging flexibility.

Google App Engine

Query semantics: Google Query Language (GQL) is a tiny subset of ANSI SQL that offers SELECT, FROM, WHERE, ORDER BY, LIMIT, OFFSET keywords, as well as the AND (but not OR) operator. WHERE clause conditions support <, <=, =, >=, > and ANCESTOR IS (entity or key) operators. LIMIT and OFFSET enable customized data paging.

Security: Updated: According to Google's Marzia in an April 10, 2008 message to the Google App Engine group, "Please keep in mind when designing any system that Google App Engine does not currently provide SSL support for your domain." There are indications in another thread that the premium (paid) version will support SSL.

SSL is optional. The documentation's "Sandbox Security Controls" topic states, "Other computers can only connect to the application by making HTTP (or HTTPS) requests on the standard ports" appears to be incorrect. (The documents don't indicate whether applications can require SSL.) SSL is supported for external URL requests with any certificate (including self-signed certificates) by the URL Fetch API but the proxy does not authenticate the host.

Authentication: The Users API enables login with the user name and password of a Google account; OpenID can be integrated with the Datastore to store user names and nonces. Developers can implement custom authentication schemes using a pure Python crypto library.

Service-Level Agreements: Not specified to date. ZDNet blogger Phil Wainwright quoted Google App Engine product manager Pete Koomen on April 11:

“A couple of the restrictions mean we’re not suitable for the business market,” he told me, citing as examples the lack of an SLA and the ceilings on usage that result in a denial of service when exceeding the limits since there’s no charging mechanism at present. “We’re much more suitable for the consumer marketplace during the preview release.”

Bigtable uses redundancy for data security and is capable of maintaining multiple data copies in a time axis.

Latency: Google's The Bigtable: A Distributed Storage System for Structured Data paper mentions latency several times but doesn't indicate its magnitude. Gabe Wachob says in a Google App Engine: Its the Architecture Stupid! post of April 13, 2008:

With App Engine, ... you'll learn to accept a certain fixed (i.e. invariant with respect to scale) latency of accessing BigTable ... in exchange for never having to have to worry about any added latency in handling 100,000 (or a million) concurrent users.

On-Premises Implementation: The local Google App Engine SDK emulates the online version but doesn't offer a scalable Datastore component.

Limitations: Lack of SSL support nullifies Google App Engine as a commercial Web data store until Google confirms that the premium version will add SSL. Requests and responses are limited to 1 MB or 1,000 entities and the response must return within a "few seconds." (According to "The Sandbox" topic, "A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.")

Following are specific daily limits for maintaining a free service (from Google App Engine for developers):

Quota Type Limit per Day
HTTP requests 650,000
Bandwidth in 9.77 GB
Bandwidth out 9.77 GB
CPU megacycles 200 million
E-mail messages 2,000
Datastore calls 2.5 million
External URL requests 160,000

Data Security: Google requires that you grant them the following license for content you store on their servers:

8.1. Google claims no ownership or control over any Content or Application. You retain copyright and any other rights you already hold in the Content and/or Application, and you are responsible for protecting those rights, as appropriate. By submitting, posting or displaying the Content on or through the Service you give Google a worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute such Content for the sole purpose of enabling Google to provide you with the Service in accordance with its privacy policy. Furthermore, by creating an Application through use of the Service, you give Google a worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute such Application for the sole purpose of enabling Google to provide you with the Service in accordance with its privacy policy.

8.2. You agree that Google, in its sole discretion, may use your trade names, trademarks, service marks, logos, domain names and other distinctive brand features in presentations, marketing materials, customer lists, financial reports and Web site listings (including links to your website) for the purpose of advertising or publicizing your use of the Service.

At present, GAE appears useful only for storing content that's shared with the general public and would not be suitable for storing or serving confidential business information. (This section added 5/6/2008.)

This blog is a work in progress. Comments, corrections, and suggestions are welcome.

 

5 comments:

Anonymous said...

Another "Data Store in the Cloud" for your list is the Talis Platform. The interface is RESTful HTTP, and two kinds of hosted stores are used - "Contentbox" (for any kind of representation) and "Metabox" (for RDF). Various services are available on top of the stores: SPARQL, augmentation etc. Let me know if you want an account to play with (danny.ayers @ the company domain).

Filip said...

Actually, for Google App Engine, SSL is not available.

See: http://groups.google.com/group/google-appengine/browse_thread/thread/ab85b3be332b2ee4

Filip said...

Google has not yet made any announcements on whether and when it will be available. It is likely to be available by the time Google App Engine is going out of beta, but other Google's applications often appear to be in eternal beta. This product may be different, but for now, all we can say is that SSL is not available. So either you send all of your data in cleartext, or you do your own encryption from scratch.

Lou Franco said...

I read "for the sole purpose of enabling Google to provide you with the Service in accordance with its privacy policy" as limiting the license you give them to your data to just what they need to publish the data in the way you specify.

I am not a lawyer, but the plain meaning of the words certainly limits the license.

Anonymous said...

Can we directly interact with Bigtable to store & retrieve data direct from javascript or ajax code.
or is there any client library available to directly interact with BigTable.
There should not be any existence of GAE. Directly 1 HTML page for sending & retrieving data....???