Thursday, April 09, 2009

Uploading Entities for Storage in Azure Tables

One of the primary considerations when selecting a Platform as a Service (PaaS) cloud provider is the ease, speed, accuracy and reliability of uploading typical entities to persistent storage, as well as the cost of data ingress/egress and long-term storage.

Updated 4/8/2009 3:50 PM PDT: Performance and transaction topics.

Windows Azure, Amazon Web Service’s SimpleDB, and the Google App Engine provide low-cost, highly scalable storage in Entity-Attribute-Value (EAV) tables. To accommodate the widest range of platforms and programming languages for creating, querying and retrieving data from cloud storage, most cloud storage application programming interfaces (APIs) use Representational State Transfer (REST) methods with HTTP as the data transfer protocol and Atom Syndication Format (Atom), Atom Publication Protocol (APP, AtomPub) or a variation of these two Internet standards as the data format on the wire.

The following analysis is adapted from Chapter 7, “Optimizing the Scalability and Performance of Azure Tables,” of my forthcoming Cloud Computing with the Microsoft Azure Services Platform book for WROX.

ADO.NET Data Services HTTP Request and Response Messages in AtomPub Format

Azure Data Services use the HTTP POST method to insert entities in tables. Following is an example of the AtomPub POST Request message to add an entity from the Northwind sample database’s Orders table to an Azure OrderTable:

POST /OrderTable HTTP/1.1
User-Agent: Microsoft ADO.NET Data Services
x-ms-date: Wed, 08 Apr 2009 22:49:49 GMT
Authorization: SharedKeyLite oakleaf3:f3o42-----RedactedKey-----6qT6r9AiYnkRS80=
Accept: application/atom+xml,application/xml
Accept-Charset: UTF-8
DataServiceVersion: 1.0;NetFx
MaxDataServiceVersion: 1.0;NetFx
Content-Type: application/atom+xml
Host: oakleaf3.table.core.windows.net
Content-Length: 1383
Expect: 100-continue

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<entry xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title />
  <updated>2009-04-08T22:49:49.8086948Z</updated>
  <author>
    <name />
  </author>
  <id />
  <content type="application/xml">
    <m:properties>
      <d:CustomerID>BSBEV</d:CustomerID>
      <d:EmployeeID m:type="Edm.Int32">4</d:EmployeeID>
      <d:Freight m:type="Edm.Double">2.17</d:Freight>
      <d:OrderDate m:type="Edm.DateTime">1998-03-11T00:00:00</d:OrderDate>
      <d:OrderID m:type="Edm.Int32">10943</d:OrderID>
      <d:PartitionKey>2147472704</d:PartitionKey>
      <d:RequiredDate m:type="Edm.DateTime">1998-04-08T00:00:00</d:RequiredDate>
      <d:RowKey m:null="false" />
      <d:ShipAddress>Fauntleroy Circus</d:ShipAddress>
      <d:ShipCity>London</d:ShipCity>
      <d:ShipCountry>UK</d:ShipCountry>
      <d:ShipName>B's Beverages</d:ShipName>
      <d:ShipPostalCode>EC2 5NT</d:ShipPostalCode>
      <d:ShipRegion m:null="true" />
      <d:ShipVia m:type="Edm.Int32">2</d:ShipVia>
      <d:ShippedDate m:type="Edm.DateTime">1998-03-19T00:00:00</d:ShippedDate>
      <d:Timestamp m:type="Edm.DateTime">0001-01-01T00:00:00</d:Timestamp>
    </m:properties>
  </content>
</entry>

Azure Tables support a simplified version of the ADO.NET Data Services (formerly the “Astoria Project”) API for create, retrieve, update, and delete (CRUD) operations by means of the Windows Azure SDK’s sample ClientServices class library. The create operation with the POST method returns a confirmation of the data inserted, such as that shown below:

HTTP/1.1 201 Created
Cache-Control: no-cache
Content-Type: application/atom+xml;charset=utf-8
ETag: W/"datetime'2009-04-08T22%3A48%3A37.9913381Z'"
Location: http://oakleaf3.table.core.windows.net/OrderTable(PartitionKey='2147472704',RowKey='')
Server: Table Service Version 1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 05b91086-1747-42d6-98ed-1ad6c85537bd
Date: Wed, 08 Apr 2009 22:48:37 GMT
Content-Length: 1774

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<entry xml:base="http://oakleaf3.table.core.windows.net/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" m:etag="W/&quot;datetime'2009-04-08T22%3A48%3A37.9913381Z'&quot;" xmlns="http://www.w3.org/2005/Atom">
  <id>http://oakleaf3.table.core.windows.net/OrderTable(PartitionKey='2147472704',RowKey='')</id>
  <title type="text"></title>
  <updated>2009-04-08T22:48:38Z</updated>
  <author>
    <name />
  </author>
  <link rel="edit" title="OrderTable" href="OrderTable(PartitionKey='2147472704',RowKey='')" />
  <category term="oakleaf3.OrderTable" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="application/xml">
    <m:properties>
      <d:PartitionKey>2147472704</d:PartitionKey>
      <d:RowKey></d:RowKey>
      <d:Timestamp m:type="Edm.DateTime">2009-04-08T22:48:37.9913381Z</d:Timestamp>
      <d:CustomerID>BSBEV</d:CustomerID>
      <d:EmployeeID m:type="Edm.Int32">4</d:EmployeeID>
      <d:Freight m:type="Edm.Double">2.17</d:Freight>
      <d:OrderDate m:type="Edm.DateTime">1998-03-11T00:00:00</d:OrderDate>
      <d:OrderID m:type="Edm.Int32">10943</d:OrderID>
      <d:RequiredDate m:type="Edm.DateTime">1998-04-08T00:00:00</d:RequiredDate>
      <d:ShipAddress>Fauntleroy Circus</d:ShipAddress>
      <d:ShipCity>London</d:ShipCity>
      <d:ShipCountry>UK</d:ShipCountry>
      <d:ShipName>B's Beverages</d:ShipName>
      <d:ShipPostalCode>EC2 5NT</d:ShipPostalCode>
      <d:ShipVia m:type="Edm.Int32">2</d:ShipVia>
      <d:ShippedDate m:type="Edm.DateTime">1998-03-19T00:00:00</d:ShippedDate>
    </m:properties>
  </content>
</entry>

Message Bloat for Order Entities with the AtomPub Wire Format

Notice that the payload for the HTTP 201 Response message is 1.3 times the size of the Request message’s payload. The combined size of the two payloads is 3,157 plus 973 header bytes for a relational table record with a UTF-8 size of about 159 bytes. The update messages are (4130 – 159) * 100/4130 = 96.2% overhead, assuming that the return payload is discarded. In other words, the Request and Response messages are about 26 times the size of uncompressed relational data. The added overhead contributes significantly to increased data ingress charges.

Message Bloat for OrderDetail Entities with the AtomPub Wire Format

Following is an HTTP Request message for a typical OrderDetail entity:

POST /OrderDetailTable HTTP/1.1
User-Agent: Microsoft ADO.NET Data Services
x-ms-date: Thu, 09 Apr 2009 21:02:25 GMT
Authorization: SharedKeyLite oakleaf3:uG/e6awcKFfuy-----RedactedKey-----pp4k+j0msRX0=
Accept: application/atom+xml,application/xml
Accept-Charset: UTF-8
DataServiceVersion: 1.0;NetFx
MaxDataServiceVersion: 1.0;NetFx
Content-Type: application/atom+xml
Host: oakleaf3.table.core.windows.net
Content-Length: 886
Expect: 100-continue

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<entry xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title />
  <updated>2009-04-09T21:02:25.7957487Z</updated>
  <author>
    <name />
  </author>
  <id />
  <content type="application/xml">
    <m:properties>
      <d:Discount m:type="Edm.Double">0.2</d:Discount>
      <d:OrderID m:type="Edm.Int32">11077</d:OrderID>
      <d:PartitionKey>2147472570</d:PartitionKey>
      <d:ProductID m:type="Edm.Int32">2</d:ProductID>
      <d:Quantity m:type="Edm.Int32">24</d:Quantity>
      <d:RowKey>00</d:RowKey>
      <d:Timestamp m:type="Edm.DateTime">0001-01-01T00:00:00</d:Timestamp>
      <d:UnitPrice m:type="Edm.Double">19</d:UnitPrice>
    </m:properties>
  </content>
</entry>

and the corresponding 201 Response message:

HTTP/1.1 201 Created
Cache-Control: no-cache
Content-Type: application/atom+xml;charset=utf-8
ETag: W/"datetime'2009-04-09T21%3A01%3A16.700813Z'"
Location: http://oakleaf3.table.core.windows.net/OrderDetailTable(PartitionKey='2147472570',RowKey='00')
Server: Table Service Version 1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: f2debcd6-1614-4380-8972-b76f54a73cd1
Date: Thu, 09 Apr 2009 21:01:16 GMT
Content-Length: 1347

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<entry xml:base="http://oakleaf3.table.core.windows.net/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" m:etag="W/&quot;datetime'2009-04-09T21%3A01%3A16.700813Z'&quot;" xmlns="http://www.w3.org/2005/Atom">
  <id>http://oakleaf3.table.core.windows.net/OrderDetailTable(PartitionKey='2147472570',RowKey='00')</id>
  <title type="text"></title>
  <updated>2009-04-09T21:01:16Z</updated>
  <author>
    <name />
  </author>
  <link rel="edit" title="OrderDetailTable" href="OrderDetailTable(PartitionKey='2147472570',RowKey='00')" />
  <category term="oakleaf3.OrderDetailTable" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="application/xml">
    <m:properties>
      <d:PartitionKey>2147472570</d:PartitionKey>
      <d:RowKey>00</d:RowKey>
      <d:Timestamp m:type="Edm.DateTime">2009-04-09T21:01:16.700813Z</d:Timestamp>
      <d:Discount m:type="Edm.Double">0.2</d:Discount>
      <d:OrderID m:type="Edm.Int32">11077</d:OrderID>
      <d:ProductID m:type="Edm.Int32">2</d:ProductID>
      <d:Quantity m:type="Edm.Int32">24</d:Quantity>
      <d:UnitPrice m:type="Edm.Double">19</d:UnitPrice>
    </m:properties>
  </content>
</entry>

The two payloads total 2,233 bytes and headers total 886 bytes. The numeric values occupy 47 bytes, which leads to an overhead of (3119 –47)*100/3119 = 98.5% or a message to payload ratio of 66:1. Providing an industry-standard API results in substantially increased data ingress charges.

The Affect of PartitionKey and RowKey Values on Scalability

For this example, each OrderTable entity has a unique PartitionKey value and RowKey zero-length string value to achieve maximum scalability by allowing distribution of entities over a potentially large number of Azure nodes. Default PartitionKey values for this example are 10-digit padded string representations of the maximum int value less the OrderID value. This combination produces the equivalent of a descending index on the primary key, which enables use of the LINQ Take(n) function to return the most recent n orders.

Associated OrderDetailsTable entities carry the parent order’s PartitionKey key value plus a zero-based line item number as the RowKey value to produce a unique object identifier (the equivalent of a relational table’s primary key.) Using the same PartitionKey value for Order and OrderDetail entities assures that a parent and its child entities are stored on the same nodes. (Currently, the original data and two replicas would be stored on three nodes in different failure domains.)

After Microsoft releases geo-location features for Azure, you’ll be able to specify the data center in which to store entities that are coded geographically, such as by state, province or region. The first center to support Azure is NorthwestUS (Quincy, WA). SouthwestUS (San Antonio (Westover Hills), TX) is expected online with Azure in 2009; it opened in September 2008.

Measuring Data Upload Performance with ADO.NET Data Services and AtomPub

Following is a screen capture of the form for conducting timing tests on create operations for OrderTable entities, with and without corresponding OrderDetailTable entities (click the image below for a full-size capture):

Loading the most recent 100 Orders with their 232 Order Details entities requires about 67.75 seconds. The time shown in the preceding screen capture is that for retrieving a single order with four line items. The following table shows the number of seconds to upload or delete 100 Orders, without and with line items (Order Details)

  Upload (Alone) Delete(Alone) Upload (with Details) Delete (with Details)
HTTP (clear) 18.86 13.55 67.75 70.92
HTTPS (TLS) 20.80 13.57 69.18 74.40

What’s interesting about the above data is the longer time to delete than create Order entities with their OrderDetail entities. Client deletions execute a GET request for the last Order entity, which includes PrimaryKey and OrderID values. The client then issues a GET /OrderDetailTable()?$filter=PartitionKey%20eq%20'2147472570' HTTP/1.1 request to obtain associated OrderDetail entities, and follows with DELETE /OrderDetailTable(PartitionKey='2147472570',RowKey='0n') instructions for n = 0 to n OrderDetail entities. Even with the extra GET method calls, the total amount of data on the wire is considerably less for DELETE than POST operations. This implies that the data creation times are more dependent on Azure Table operations in the cloud than message transit times. 

Cross-Table Transactions for Azure Tables

The latest word on transactions for Azure Tables is from Microsoft’s Alex Gershaft in 3/24/2009 answer to Azure table storage and state management questions in the Windows Azure forum:

There is no current support for transactions, but it is something that is on our future feature list.

The question, of course, is “How far in the future?” I’m guessing quite a while because MSFT wants to maximize revenue from SDS surcharges for persistent storage with relational features.

There was discussion of using Azure Queues for Cross-Table Transactions in the early days of Azure tables (see Brad Calder’s answer of 10/29/2008 in the Transactions and Locking? Window Azure forum thread.) A couple of days later, Calder replied to the Design pattern for data structures thread:

Great point.   Yes, you can create a property called “Kind”/“Type” (or whatever you want) for all of the entities stored in your Test Table,  and then treat the entities within a partition as entities belonging to N different types/classes.  This can allow you to model within a partition basically several “mini-tables”, where you can think of each Kind/Type being a mini-table within the partition.    Then once we have Entity Group transactions exposed that will allow you to perform atomic transactions across all of the entities within the partition, which means you can perform atomic transactions across all of these Kind/Type entities (mini-tables) in that partition.

Not much has been heard about queue-based or Kind/Type-assisted transactions recently.

This post will be updated as additional tests under varying conditions are conducted.

8 comments:

KristoferA said...

Interesting. Help me understand a couple of details:

1) Is it one http roundtrip per record?

2) If so, is there any way to insert and/or update multiple records per roundtrip instead? e.g. insert an order and all its' order details?

3) If no bulking is supported, how can transactional integrity be maintained? Or is there no transactional integrity? That would mean another round of roundtrips to ensure that the order is complete and flag it as such...

It would be interesting to sprinkle a couple of tickcounts and timestamps in the tables record to determine how much of the time is spent in the http roundtrip and how much is spent on the database-side of things.

--rj said...

Hi, Kristofer,

I'll attempt to answer you questions in an update to the post today.

Cheers,

--rj

Daniel Root said...

I'm a little confused by SDS/Azure tables as it relates to the recent announcement that SDS will support more traditional SQL Server protocols (ala the 'knock my socks off post). My takeaway was that there would be a more straight-forward database-to-the-cloud migration, but it sounds like things are full-steam-ahead with atompub and EAV stuff. What am I missing here?

--rj said...

Daniel,

Azure tables and SDS both default to ADO.NET Data Services (Astoria) AtomPub format for communication with clients over the Interet.

Substituting SQL Server's native TDS (Tabular DataStream) format would require opening TCP port 1433(and perhaps 1434) in the firewall. Few network admins who had previous experience with the Slammer worm would be willin to do this.

Azure EAV tables are inherently more scalable than relational tables. There undoubtedly will be a surcharge to substitute SDS for EAV tables, but MSFT hasn't provided even an inkling about their Azure/SDS pricing.

--rj

AJ said...

Have you measured performance (or have any links to articles) on the query performance of Azure Tables. For example, retrieving 500 records at a time?

Thanks
PS. Very informative blog

Daniel Root said...

rj - I totally get why SDS, etc. use(d) REST, but am confused about where exactly the TDS support mentioned in your previous post applies: http://oakleafblog.blogspot.com/2009/03/sql-data-services-abandons-rest-for-tds.html

Why is this test app still REST if it's all TDS now?

Daniel Root said...

ah. nevermind - so TDS is an _option_, but not the only api?

--rj said...

Daniel,

TDS will be the preferred transport for operations between WebRoles and WorkerRoles and SDS within the data center because, as I said earlier, few (if any) network admins will open a firewall hole for port 1433.

I expect that most developers will use the REST API for I/O over the Internet.

--rj