OakLeaf Systems: Resource Links for SQL Azure Federations and Sharding Topics

Tuesday, February 08, 2011

Resource Links for SQL Azure Federations and Sharding Topics

This post originated in the “SQL Azure Database and Reporting” section of my Windows Azure and Cloud Computing Posts for 1/15/2011+ post. I’ve been adding to the links on an almost daily basis, so I decided to create a free-standing article that I’ll update independently. The earlier post will no longer be updated.

•• Table of Contents:

SQL Azure Federations
Sharding Proxies for RDBMSs: dbShards and ScaleBase
Hibernate Shards
NHibernate Shards
Generic Sharding and NoSQL Articles

•••• Updated 2/7/2011: Added link to and description of Herve Roggero’s open-source Enzo SQL Shard, an SQL Server and SQL Azure Shard Library, to the “SQL Azure Federations” section. Enzo SQL Shared uses the Task Parallel Library to improve performance. Also updated Cihan Biyikoglu’s picture and bio from the updated version (2/4/2011) in his blog .

Further, from Microsoft’s Brian Aurich on 2/7/2011: Announcement: Windows Azure Storage & SQL Azure forums are being split:

Due to increased discussion volume on the topics of both Windows Azure Storage and SQL Azure we are splitting this forum out.

SQL Azure and other relational storage related discussions are being moved to the SQL Azure forum.

This forum is being repurposed for Windows Azure Storage and other non-relational storage related discussions.

Existing SQL Azure posts will be migrated to the SQL Azure forum over the next couple weeks.

Brian is the Windows Azure Community PM.

••• Updated 2/2/2011: Added post by Shaun Ziyan Xu about his Parallel-Oriented Data Access (PODA) C# library, which emulates some SQL Azure Federation features, to the “SQL Azure Federations” section and John Rayner’s “Sharding to the Clouds” post of 11/10/2010 to the “NHibernate Shards” section.

•• Update 1/28/2011: Added “Sharding Proxies for RDBMSs: dbShards and ScaleBase” section below.

• Update 1/19/2011: Added SQL Azure Federations: Robust Connectivity Model for Federated Data, a 1/18/2010 post by Cihan Biyikoglu.

SQL Azure Federations

Cihan Biyikoglu published a list of blog posts and white papers about SQL Azure Federations on 12/16/2011 in an answer to an SQL Azure Federation thread of the Windows Azure Platform - SQL Azure, Windows Azure Storage & Data forum:

We have talked about SQL Azure Federations technology at PDC and PASS this year and since then, we pushed out a number of posts and papers out. To help bring all these together, here is the list;

Introduction to App design with Sharding and Horizontal Partitioning Technique [Title: “Sharding with SQL Azure”]: This whitepaper summarizes the technique and highlights the new application modeling best practices for best scalability and economics.

Evaluation of Scale-up vs Scale-out: how to evaluate the scalability options?

Intro to SQL Azure Federations: SQL Azure Federations Overview

Perfect scenarios and typical applications that highlight the power of SQL Azure Federations technology

How to scale out an app with SQL Azure Federations – quick walk through of building and app with SQL Azure Federations.

We will announce any news about partner programs to get a preview of the technology. Please stay tuned. In the meantime, if you have any questions about the technology, I am happy to help. You can reach me through the forums or through the blog http://blogs.msdn.com/b/cbiyikoglu/.

Cihan is a Program Manager in SQL Azure covering the programmability surface and the engine components for SQL Azure. Currently, Cihan is focusing on enabling SQL Azure scale-out database solutions through SQL Azure Federations feature.

•••• The link to the SQL Azure Federation thread will be updated when it moves to the new SQL Azure forum.

Here are a few additional articles related to SQL Azure Federations and sharding (in reverse date order):

• SQL Azure Federations: Robust Connectivity Model for Federated Data 1/18/2010 post by Cihan Biyikoglu
Know Your Data 12/27/2010 blog post by Wayne Berry
Scale Out Your SQL Azure Database 12/24/2010 blog post by Wayne Berry
Sharding With SQL Azure 12/23/2010 blog post related to the below by Steve Yi
Sharding with SQL Azure 12/14/2010 detailed TechNet wiki article by Michael Heydt (first item of Cihan’s links list)
SQL Azure Federation: Is it "NoSQL" Azure too? 11/23/2010 Redmond Diary article by Andrew Brust re Lev Novik’s PDC 2010 session (see below)
SQL Azure Backup and Restore Strategy 11/18/2010 TechNet wiki article by Wayne Berry
SQL PASS 2010, Nov 8th to 12th in Seattle 11/5/2010 post by Cihan Biyikoglu describing his two PASS presentations about SQL Azure
Building Scalable Database Solution with SQL Azure - Introducing Federation in SQL Azure, 10/29/2010 blog post by Cihan Biyikoglu
Building Scale-Out Database Solutions with SQL Azure, video of 10/29/2010 PDC 2010 presentation by Lev Novik
Windows Azure: Cost Architecting for Windows Azure 10/2010 TechNet Magazine article by Maarten Balliauw
The Real Cost of Indexes 8/19/2010 blog post by Wayne Berry
SQL Azure Horizontal Partitioning: Part 2 6/24/2010 blog post by Wayne Berry
Vertical Partitioning in SQL Azure: Part 1 5/17/2010 blog post by Wayne Berry
Connections and SQL Azure 5/11/2010 blog post by David Robinson
Uniqueidentifier and Clustered Indexes 5/5/2010 blog post by David Robinson
Exporting Data from SQL Azure: Import/Export Wizard 5/4/2010 blog post by Wayne Berry

•••• Herve Roggero (@hroggero) calls his open-source Enzo SQL Shard an “SQL Server and SQL Azure Shard Library” and offers C# source-code, as well as Windows Installer packages. From the CodePlex home page:

This library provides a Shard technology that allows you to spread the load of database queries over multiple databases easily (SQL Server and SQL Azure). Uses the Task Parallel Library (TPL) and caching for high performance.

For more information about our company, visit http://www.bluesyntax.net.

Details

This library allows you to perform all the usual tasks in database management: create, read, update and delete records. Except that your code can execute against two or more databases seamlessly. Using the usual SqlCommand object, spread the load of your commands to multiple databases to improve performance and scalability. This library uses a Horizontal Partition Shard, which requires your tables to be partitioned horizontally. The download comes with a sample application that shows you how to use the shard library and gives you execution time in milliseconds of your commands.

Here is sample code that allows you to fetch records from the USERS table across multiple databases. As you can see, extension methods on the SqlCommand object makes this program easy to read; new Shard methods have been added to SqlCommand that allow you to execute reads, writes and insert operations in parallel.
  SqlCommand cmd = new SqlCommand();
  // Set the parallel option
  PYN.EnzoAzureLib.Shard.UseParallel = true;
  PYN.EnzoAzureLib.Shard.UseCache = true;
  cmd.CommandText = "SELECT * FROM USERS"; 
  DataTable data = cmd.ExecuteShardQuery(); 
  dataGridView2.DataSource = data;
A sample application is provided in this project allowing you to see the shard definition. Each database making up the shard is shown in this screen. The databases are loaded from a configuration file, or can be added manually through the user interface.

Caching
This library allows you to selectively cache your records so you do not have to perform additional roundtrips to the database. The caching library used is from the Enterprise Library, which is now included in .NET 4.0.
Task Parallel Library (TPL)
And for additional performance, the library uses the TPL, which is also part of .NET 4.0. You can turn this feature on or off and see the impact on performance when you query your Shard using parallel threads.
Round Robin Inserts
And if you need to insert many records in the shard, you can use the ExecuteParallelRoundRobinLoad method, which takes a List of command objects and spreads their execution in a round-robin manner to all the underlying databases in the shard.
Performance

The shard library can improve performance drastically depending on the type of data and number of records you have. For example the screenshot below shows that the application is returning 2 records in 2.6 seconds (each record contains a PDF file of a few megabytes each). The second image shows the same two records, however each record is stored in a separate database. Since fetching records from a shard uses multithreading, the same two records took 1.4 seconds to be fetched. You can see the database GUID being added dynamically from the shard library showing you which database each record came from.

Requirements

This library runs only with the .NET 4.0 framework.

Herve is a SQL Azure MVP.

••• Shaun Ziyan Xu posted a link to the source code for a Parallel-Oriented Data Access (PODA) C# library to Google Code in his Happy Chinese New Year! post of 2/1/2011:

If you have heard about the new feature for SQL Azure named SQL Azure Federation, you might know that it’s a cool feature and solution about database sharding. But for now there seems no similar solution for normal SQL Server and local database. I had created a library named PODA, which stands for Partition Oriented Data Access which partially implemented the features of SQL Azure Federation.

I’m going to explain more about this project after the Chinese New Year but you can download the source code here.

Downloads from Google Code don’t support IE 8 and require Mozilla Firefox or Chrome. [Parochial, no?].

Here’s the contents of PODA’s Readme.txt file:

PODA - Parallel Orient Data Access

.. Designed and implemented by Shaun Ziyan Xu (jfarrio[at]gmail.com)
.. Owned by chayu.org (http://www.chayu.org/)

STEP 1: Prepare the application configuration file for PODA. You can refer the sample.app.config in the release package.

STEP 2: Initlize the PODA at the begining of your application started. Use Poda.Factory.Config().

STEP 3: Create the PODA instance by using Poda.Factory.Create() and wrapped in the 'using' block.

Do not forget to invoke poda.Commit() to save the changes if you insert or update the records.

If you need to retireve the records in reference table please use ReferenceOn(tableName) method.

If you need to insert or update the records in reference table please use FederationOnAll() method.

I’ll update this post when Shaun recovers from celebrating Chinese New Year (also big in Oakland) and delivers the promised explanation.

See also Understanding Federated Database Servers: SQL Server 2008 R2 in the MSDN Library. SQL Server federation uses the Enterprise edition’s Distributed Partitioned Views. Related topics include:

•• Sharding Proxies for RDBMSs: dbShards and ScaleBase

Code Futures’s dbShards and ScaleBase are proxies that make multiple servers running MySQL look line a single RDBMS:

dbShards update 1/25/2011 blog post to DBMS2, Curt Monash interview with Cory Isaacson of Code Futures
ScaleBase, another MPP OLTP quasi-DBMS 1/25/2011 Curt Monash blog post to DBMS2 quoting Liran Zelkha
I’m collecting data points on NoSQL and HVSP adoption 8/18/2010 post by Curt Monash, includes status of NoSQL stores
dbShards — a lot like an MPP OLTP DBMS based on MySQL or PostgreSQL 7/28/2010 post by Curt Monash
DbShards Part Deux - The Internals 7/13/2010 from High Scalability
Product: dbShards - Share Nothing. Shard Everything. 6/23/2010 from High Scalability

dbShards also supports PostgreSQL.

Hibernate Shards

Andrew Glover wrote Java development 2.0: Sharding with Hibernate Shards, Horizontal scalability for relational databases, on 8/31/2010 for IBM’s DeveloperWorks site. From the abstract:

Sharding isn't for everyone, but it's one way that relational systems can meet the demands of big data. For some shops, sharding means being able to keep a trusted RDBMS in place without sacrificing data scalability or system performance. In this installment of the Java development 2.0 series, find out when sharding works, and when it doesn't, and then get your hands busy sharding a simple application capable of handling terabytes of data.

Following are the Hibernate Shard resources for Java that Andy listed in his article:

Java development 2.0: This developerWorks series explores technologies and tools that are redefining the Java development landscape, including NoSQL (May 2010), SimpleDB (June 2010), and CouchDB (November 2009).
"Sharding with Max Ross: Hibernate Shards" (JavaWorld, July 2008): Andy Glover talks with the founder of Hibernate Shards about the motivations behind the project and how to leverage shards effectively.
"Think twice before sharding" (Andrew Glover, thediscoblog.com, June 2008): Sharding doesn't come without risks, so find out more about them before you commit.
"Rethinking the traditional DAO pattern" (Andrew Glover, thediscoblog.com, June 2008): Learn more about folding DAOs into normal domain objects, a la Grails and Rails.

NHibernate Shards

Ayende Rahien issued SQL Azure, Sharding and NHibernate: A call for volunteers, which has many interesting comments about SQL Azure and sharding, on 9/6/2009. His NHibernate Shards: Progress Report appeared on 10/18/2009. Stay tuned for more resources related to NHibernate Shards.

John Rayner demonstrated NHibernate Shards and provided source code in his Sharding into the cloud post of 11/10/2010 to the # Fellows blog:

NHibernate.Shards is an extension to the well-known ORM which allows a logical database to be partitioned across multiple physical databases and servers. It's a port of the Hibernate.Shards project, as with lots of thing in NHibernate. I thought it would be interesting to see how well it worked against SQL Azure. It turned out to be not interesting at all ... just plain easy!

Step 1: Register with SQL Azure
Turnaround on token requests is pretty quick right now (<24 hours in my case).

Step 2: Setup some SQL Azure databases

Step 3: Setup appropriate logins, users
The SQL Azure team have done a great job to allow SQL Server Management Studio to connect a query window to an Azure database, but I'm a bit SQL-phobic at the best of times. This was the most challenging bit for me!

Step 4: Download and compile NHibernate.Shards from NHContrib Subversion

Step 5: Set your connection strings in the config file

Step 6: Press play. Really, that's all there is to it!

Now you may notice that I neglected to create any schema in the Azure databases - that's because NHibernate can do that for me. Did I mention that I'm a bit SQL-phobic? [;)]

The code I was using was the standard example that comes with NHibernate.Shards, which records WeatherReport objects, which I've attached. It's the same example that Ayende dissected, so you can also pick up his discussion of hards-progress-report.aspx" mce_href="http://ayende.com/Blog/archive/2009/10/18/nhibernate-shards-progress-report.aspx">the workings of NHibernate.Shards. The code looks like this (click to enlarge):

And the results are as follows (click to enlarge):

Some of the features of NHibernate.Shards that really stood out for me:

It can query all shards in parallel or sequentially. For SQL Azure, that's quite useful! A sequential query my single-record shards took 601ms, whereas a parallelized query took 411ms (almost 33% less).

New records can be allocated to shards based on either the data (e.g. surname starts with A-M or N-Z) or some other scheme (e.g. round-robin).

If the correct shard can be identified based on an object's identity, then only that single shard is queried to retrieve the entity (this is based on your own IShardResolutionStrategy implementation).

If you sort by a property, then this sort will be applied even when data is merged from multiple shards.

Overall though, it all just works tremendously well. Congratulations really must go to:

The Microsoft SQL Azure team

Dario Quintana, for his work on NHibernate.Shards

Fabio, Ayende and the rest of the NHibernate committers

EDIT: Querying data from the shards is done using code like the following. You should notice that this code makes no references to the shards, and in fact is "normal" NHibernate code. The sharding is all handled transparently in the ORM.

The source code is available from Shard sample.zip.

Generic Sharding and NoSQL Articles

NoSQL: Wikipedia article last modified 1/20/2011
Shard (database architecture): Wikipedia article last modified 1/3/2011
Troubles with Sharding - What can we learn from the Foursquare Incident? (Todd Hoff, HighScalability.com, 10/15/2010): Describes the incident with MongoDB shard problems, A Trouble Topology for Shards; Possible Responses to Troubled Shards, and Related Articles.
VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process: A detailed review of VoltDB by Todd Hoff
The Naming of the Foo: HVSP (High-Volume Simple Processing) acronym to replace NoSQL by Curt Monash, 3/13/2010
"An Unorthodox Approach to Database Design : The Coming of the Shard" (Todd Hoff, HighScalability.com, 8/6/2009): A nice overview of sharding from a site that specializes in architectures for managing big data. Contains updates with other links listed here.
"Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes" (Dare Obasanjo, 25HoursADay.com, January 2009): Uses Facebook as a real-world example to explore sharding strategies
Sharding for Startups (Eric Ries, 1/4/2009): Describes strategies for selecting a sharding scheme before you need it.
Shard – A Database Design (Rahul Roy, 7/28/2008) an early, brief sharding article.
Alex Popescu’s myNoSQL blog provides concise descriptions of new products and happenings in the NoSQL community on a daily basis.
You’ll often find current NoSql articles in the OakLeaf Systems Blog’s “Other Cloud Computing Platforms and Services” section.