Amazon Web Services’ Elastic Computing Cloud (EC2) with Windows Server 2003 offers a choice of the following database management systems for cloud-based data-intensive Web applications:
- SQL Server 2005 Express Edition (default, $0.125/instance-hour, no surcharge)
- SQL Server 2005 Standard Edition (optional, US$0.60/instance-hour surcharge*)
- Amazon SimpleDB (optional, US$0.14/processor-hour) with optional S3 blob storage
* SQL Server 2005 Standard Edition requires a Standard Large instance type, which carries a surcharge of US$0.375/hour over the default Standard Small instance. (The surcharge is US$0.50/hour for Authentication Services.) Full EC2 with Windows pricing details are available here.
Most Windows developers probably will prefer use of a relational database to SimpleDB’s Entity-Attribute-Value (EAV) approach that’s limited to the string data type. SQL Server 2005 Express is limited to 1 CPU, 4 GB RAM and 4 GB database size, so it’s best suited to use with the default US$0.125/hour instance: 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, and 32-bit platform. EC2 instance type specifications are available here.
Scaling up from SQL Server 2005 Express probably would involve migrating to at least a Large Instance: 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, and 64-bit platform. In this configuration, users would choose Elastic Block Store for data storage at a cost of $0.10 per allocated GB per month plus $0.10 per 1 million I/O requests to the volume.
The US$0.60/instance-hour surcharge results in a average cost of 30 * 24 * $0.60 = $432/instance-month or $5,184/instance-year for SQL Server surcharges. This substantial incremental cost for SQL Server 2005 Standard Edition warrants investigation of whether SimpleDB could substitute for SQL Server and provide a cost saving. Data transfer between SimpleDB and other Amazon Web Services is free, but storage is $0.25/GB-month (down from $1.50/GB-month previously.) See Amazon SimpleDB for more details. S3 blob storage is $0.15/GB-month with no charge for data transfer to and from EC2.
Amazon’s Simple Monthly Calculator lets you analyze costs if you know the amount of data and in/out traffic with EC2.
The SQL Server Express Test Harness
Before deciding whether to consider substituting SimpleDB for SQL Server, it’s a good idea to compare its performance with SimpleDB for tasks similar to those to be used in production. The following test harness uses a LINQ to SQL data source, and a GridView bound to a LinqDataSource with built-in server-based paging:
An Elastic Block Store folder holds the NORTHWND.MDF and NORTHWND.LDF files. The Customers table is a copy of the original Northwind Customers table without the relationship to the Orders table, which would prevent deleting rows. Insert with the FormView control is enabled for these tests but disabled when the page is publicly accessible.
Following is the average time in seconds to perform five sets of a paged SELECT of 12 rows, iterative COUNT, single INSERT, as well as DELETE, INSERT and UPDATE operations on 91 Customers records five times (except single INSERT):
The multiple executions result in a substantial reduction of execution time due to caching (estimated to be about 50%.) An iterative (rather than an aggregate) count emulates the SimpleDB count method, because SimpleDB doesn’t provide aggregate values.
Both test harnesses time only the duration of server operations and don’t include HTTP transport time to and from the browser.
All services are beta versions. Therefore, the timing comparisons might not reflect results when providers release the services.
The SimpleDB Test Harness
The SimpleDB test harness uses the Amazon SimpleDB C# library which supports token-based paging with the GridView bound to an ObjectDataSource control for SELECT and INSERT operations. This harness’s data source architecture most closely resembles Azure Table Services because it uses a cloud-based Entity-Attribute-Value SimpleDB table and, optionally, S3 blob storage that’s not linked directly to entities.
Average time in seconds and SimpleDB usage cost to perform five sets of a paged SELECT of 12 rows, iterative COUNT, and single INSERT, as well as multiple DELETE, INSERT and UPDATE operations on 91 Customers records:
|SDB Box Usage||$0.00016||$0.00028||$0.00011||$0.00184||$0.00186||$0.00231|
Box usage cost is based on the current AWS price of $0.14/processor-hour. Colin Percival’s Dissecting SimpleDB BoxUsage post of 6/25/2008 provides a detailed analysis of the vicissitudes of BoxUsage calculations. Travis Reeder’s How Much Would it Cost to run Twitter on Amazon's SimpleDB? post of 7/1/2008 calculates query costs but omits Web front-end instance and bandwidth charges.
Comparison of Execution Speed of EC2 with SQL Express or SimpleDB and Azure Table Services
Following is a comparison of EC2 execution times with those of the original OakLeaf Systems Azure Table Services Sample Project that’s described at Azure Storage Services Test Harness: Table Services 1 – Introduction and Overview, et seq.
|EC2 with SQL Express||0.005||0.002||0.019||0.380||0.112||0.374|
|EC2 with SimpleDB||0.048||0.094||0.048||10.094||22.142||10.855|
|Azure Table Services||0.215||0.188||0.155||5.820||5.310||6.561|
I was surprised to find Azure Table Services’ execution to be substantially slower than EC2 with SimpleDB for single-page SELECTS, iterative counts, and single INSERT operations.
With the exception of the Insert operation, you can verify the Azure Table Services timing data because the harness is available whenever Azure Data Services is up. EC2 services will be made available on request for brief periods (contact roger_jennings[at]compuserve[dot]com).
I plan to create a similar test harness for SQL Data Services to complete the “OakLeaf Test Harness Quartet” (with apologies to Lawrence Durrell.) The initial version will allow comparison of Plain Old XML (POX) and REST wire formats (see SQL Data Services (SDS) Test Harness Updated to the Windows Azure Services Platform of 11/1/2008). When the ADO.NET Data Services AtomPub front end for SQL Data Services is available, it will be added as another test option.
Performance Issues with EAV Databases and RESTful Wire Formats
Update 12/14/2008: An anonymous commenter observes:
A major piece of work [by] OakLeaf … has confirmed my suspicions about how inappropriate all the cloud name/value entity store technologies are for serious SaaS developers.
The Google AppEngine Datastore, Amazon’s SimpleDB and Windows Azure have chronic performance problems relative to conventional database throughput. Ultimately the inherent inefficiencies of these storage options will hit hourly cloud renters in the pocket.
It’s probably a bit early to apply the term “chronic” to Azure Table Services (ATS), which has only been available in limited CTP form for less than two months, but SQL [Server] Data Services has been available to a limited number of testers since early March 2008. Although the S[S]DS team has promised to add schemas and other relational features so as to distinguish its deliverable from ATS, SDS remains an EAV database with free-form “flex[ible] properties.” Although Microsoft has produced much qualitative hype about the elasticity, scalability, reliability, and availability of SDS and ATS, quantitative data for these services is almost impossible to obtain. In particular, the SDS team has offered no information on what performance penalties developers should expect in return for the other perceived benefits of cloud-based databases.
EAV Databases for Clinical Trials
There appears to be only a small set of papers that compare create-retrieve-update-delete (CRUD) performance of EAV and relational databases with similar data. Most published comparisons are for medical databases that store clinical data, such as Yale’s ACT/DB for managing clinical trials data. Following are a few references to EAV performance:
Dynamic Tables: An Architecture for Managing Evolving, Heterogeneous Biomedical Data in Relational Database Management Systems (Journal of the American Medical Informatics Association, 2007) proposes a sparse, column-based storage, which the authors call dynamic tables. This paper is useful because it provides graphical performance comparisons of relational, EAV/CR, and dynamic table queries returning 1 to 5 attributes, with and without indexes (Figures 6, 7, 9 and 10).
Exploring Performance Issues for a Clinical Database Organized Using an Entity-Attribute-Value Representation (Journal of the American Medical Informatics Association, 2000) examines the EAV representation with classes and relationships (EAV/CR) model and concludes:
Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
Throwing more resources at a database management systems doesn’t reduce inefficiency of the data model, but it might make query execution time tolerable.
Generic Design of Web-Based Clinical Databases (Journal of Medical Internet Research, 2003) discusses the performance of EAV databases for clinical research use:
From a performance point of view, the strength of the EAV design lies in effective entity-centered queries since no joins are necessary to retrieve all facts about entities (eg, patients or medical events) as would be the case in a conventional design with facts spread over hundreds of tables. The drawback lies in inefficient attribute-centered queries, since a (self) join is necessary for each attribute that is requested.
Performance of EAV tables may not be an issue for small databases, but for large clinical repositories with hundreds of concurrent users, query time may be a critical factor. Also, the need for complex attribute-centered data retrieval differs greatly between applications. An electronic patient-record system, for example, is usually aimed at displaying patient-centered (ie, entity-centered) facts, while a research database usually must have some means of aggregating data across a large number of patients. In the latter, however, query efficiency may not be a problem, since data summaries are retrieved only intermittently and may be stored on separate hardware.
Data Extraction and Ad Hoc Query of an Entity—Attribute—Value Database (Journal of the Medical Informatics Association, 1998) describes the difficulty of creating and executing attribute-based queries:
Complex Boolean queries of a conventional table (for example, “identify all patients where sex is female AND age >=35 AND city = `New York”') can be created easily by database neophytes through graphic front ends such as GQL and Microsoft's MS Query. The EAV equivalent is significantly more complicated, because each attribute-value pair for a patient is stored as a separate row in a table. When performing a similar complex Boolean query on a single EAV table, the conceptual AND, OR, and NOT operations must be translated into the row-based operations of set union, set intersection, and set difference, respectively.
The preceding papers indicate that EAV isn’t likely to come close to matching the performance of popular RDBMS any day soon.
SimpleDB Performance Analyses
Comparative performance data for SimpleDB and RDBMSs isn’t easy to come by either. Todd Hoff writes in his The Search for the Source of Data - How SimpleDB Differs from a RDBMS post of 4/22/2008:
Alex [Tolley of MyMeemz.com] tested retrieving 10 record ids from 3 different database sizes. Using a 1K record database it took an average of 141 msecs to retrieve the 10 record ids. For a 100K record database it took 266 msecs on average. For a 1000K record database it took an average of 433 msecs to retrieve the 10 record ids. It's not fast, but it is relatively consistent. [Italic emphasis added.]
This observation might lead to paraphrases like “Relative consistency is the hobgoblin of performant queries.”
Todd’s earlier The Current Pros and Cons List for SimpleDB of 12/15/2007 makes interesting reading, although Amazon has corrected some “cons.”
If you know of any other performance data, especially comparisons, please leave a comment.
SDS Performance Hit from REST vs. POX Wire Format
I’ve bothered several members of the Astoria team, including Pablo Castro, as well as the S[S]DS folks about the performance hit I see when comparing execution speed of my SDS Test Harness (see SQL Data Services (SDS) Test Harness Updated to the Windows Azure Services Platform of 11/1/2008). Uploading Northwind data to my SDS instance takes more than twice as long with the current REST protocol than Plain Old XML (POX), but I haven’t received any reaction to my tests.
I expect moving to an AtomPub front end for SDS and an ADO.NET Data Services-style client layer will increase the REST performance differential further. Despite the major move by Microsoft to RESTful AtomPub data APIs, I’ve heard no discussion of the effect of AtomPub wire overhead on CRUD operations, especially for queries that return associated entities.