Tuesday, February 08, 2011

Resource Links for SQL Azure Federations and Sharding Topics

This post originated in the “SQL Azure Database and Reporting” section of my Windows Azure and Cloud Computing Posts for 1/15/2011+ post. I’ve been adding to the links on an almost daily basis, so I decided to create a free-standing article that I’ll update independently. The earlier post will no longer be updated.

Table of Contents:

•• Updated 2/7/2011: Added link to and description of Herve Roggero’s open-source Enzo SQL Shard, an SQL Server and SQL Azure Shard Library, to the “SQL Azure Federations” section. Enzo SQL Shared uses the Task Parallel Library to improve performance. Also updated Cihan Biyikoglu’s picture and bio from the updated version (2/4/2011) in his blog .

Further, from Microsoft’s Brian Aurich on 2/7/2011: Announcement: Windows Azure Storage & SQL Azure forums are being split:

image Due to increased discussion volume on the topics of both Windows Azure Storage and SQL Azure we are splitting this forum out.

  • SQL Azure and other relational storage related discussions are being moved to the SQL Azure forum.
  • This forum is being repurposed for Windows Azure Storage and other non-relational storage related discussions. 

Existing SQL Azure posts will be migrated to the SQL Azure forum over the next couple weeks.

Brian is the Windows Azure Community PM.

••• Updated 2/2/2011: Added post by Shaun Ziyan Xu about his Parallel-Oriented Data Access (PODA) C# library, which emulates some SQL Azure Federation features, to the “SQL Azure Federations” section and John Rayner’s “Sharding to the Clouds” post of 11/10/2010 to the “NHibernate Shards” section.

Update 1/28/2011: Added “Sharding Proxies for RDBMSs: dbShards and ScaleBase” section below.

• Update 1/19/2011: Added SQL Azure Federations: Robust Connectivity Model for Federated Data, a 1/18/2010 post by Cihan Biyikoglu.

SQL Azure Federations

Cihan Biyikoglu published a list of blog posts and white papers about SQL Azure Federations on 12/16/2011 in an answer to an SQL Azure Federation thread of the Windows Azure Platform - SQL Azure, Windows Azure Storage & Data forum:

image We have talked about SQL Azure Federations technology at PDC and PASS this year and since then, we pushed out a number of posts and papers out. To help bring all these together, here is the list;

imageWe will announce any news about partner programs to get a preview of the technology. Please stay tuned. In the meantime, if you have any questions about the technology, I am happy to help. You can reach me through the forums or through the blog http://blogs.msdn.com/b/cbiyikoglu/.

Cihan is a Program Manager in SQL Azure covering the programmability surface and the engine components for SQL Azure. Currently, Cihan is focusing on enabling SQL Azure scale-out database solutions through SQL Azure Federations feature.

•• The link to the SQL Azure Federation thread will be updated when it moves to the new SQL Azure forum.


Here are a few additional articles related to SQL Azure Federations and sharding (in reverse date order):


•• Herve Roggero (@hroggero) calls his open-source Enzo SQL Shard an “SQL Server and SQL Azure Shard Library” and offers C# source-code, as well as Windows Installer packages. From the CodePlex home page:

image This library provides a Shard technology that allows you to spread the load of database queries over multiple databases easily (SQL Server and SQL Azure). Uses the Task Parallel Library (TPL) and caching for high performance.

For more information about our company, visit http://www.bluesyntax.net.

Details

imageThis library allows you to perform all the usual tasks in database management: create, read, update and delete records. Except that your code can execute against two or more databases seamlessly. Using the usual SqlCommand object, spread the load of your commands to multiple databases to improve performance and scalability. This library uses a Horizontal Partition Shard, which requires your tables to be partitioned horizontally. The download comes with a sample application that shows you how to use the shard library and gives you execution time in milliseconds of your commands.

Here is sample code that allows you to fetch records from the USERS table across multiple databases. As you can see, extension methods on the SqlCommand object makes this program easy to read; new Shard methods have been added to SqlCommand that allow you to execute reads, writes and insert operations in parallel.

  SqlCommand cmd = new SqlCommand();
  // Set the parallel option
  PYN.EnzoAzureLib.Shard.UseParallel = true;
  PYN.EnzoAzureLib.Shard.UseCache = true;
  cmd.CommandText = "SELECT * FROM USERS"; 
  DataTable data = cmd.ExecuteShardQuery(); 
  dataGridView2.DataSource = data;

A sample application is provided in this project allowing you to see the shard definition. Each database making up the shard is shown in this screen. The databases are loaded from a configuration file, or can be added manually through the user interface.

TestShardDef.jpg

Caching
This library allows you to selectively cache your records so you do not have to perform additional roundtrips to the database. The caching library used is from the Enterprise Library, which is now included in .NET 4.0.
Task Parallel Library (TPL)
And for additional performance, the library uses the TPL, which is also part of .NET 4.0. You can turn this feature on or off and see the impact on performance when you query your Shard using parallel threads.
Round Robin Inserts
And if you need to insert many records in the shard, you can use the ExecuteParallelRoundRobinLoad method, which takes a List of command objects and spreads their execution in a round-robin manner to all the underlying databases in the shard.
Performance

The shard library can improve performance drastically depending on the type of data and number of records you have. For example the screenshot below shows that the application is returning 2 records in 2.6 seconds (each record contains a PDF file of a few megabytes each). The second image shows the same two records, however each record is stored in a separate database. Since fetching records from a shard uses multithreading, the same two records took 1.4 seconds to be fetched. You can see the database GUID being added dynamically from the shard library showing you which database each record came from.

TestAppBefore.jpg

TestAppAfter.jpg

Requirements

This library runs only with the .NET 4.0 framework.

Herve is a SQL Azure MVP.


••• Shaun Ziyan Xu posted a link to the source code for a Parallel-Oriented Data Access (PODA) C# library to Google Code in his Happy Chinese New Year! post of 2/1/2011:

image If you have heard about the new feature for SQL Azure named SQL Azure Federation, you might know that it’s a cool feature and solution about database sharding. But for now there seems no similar solution for normal SQL Server and local database. I had created a library named PODA, which stands for Partition Oriented Data Access which partially implemented the features of SQL Azure Federation.

imageI’m going to explain more about this project after the Chinese New Year but you can download the source code here.

Downloads from Google Code don’t support IE 8 and require Mozilla Firefox or Chrome. [Parochial, no?].

Here’s the contents of PODA’s Readme.txt file:

PODA - Parallel Orient Data Access


.. Designed and implemented by Shaun Ziyan Xu (jfarrio[at]gmail.com)
.. Owned by chayu.org (http://www.chayu.org/)


 

 

STEP 1: Prepare the application configuration file for PODA. You can refer the sample.app.config in the release package.


 

 


STEP 2: Initlize the PODA at the begining of your application started. Use Poda.Factory.Config().


 



 

 

 

STEP 3: Create the PODA instance by using Poda.Factory.Create() and wrapped in the 'using' block.


 


Do not forget to invoke poda.Commit() to save the changes if you insert or update the records.


If you need to retireve the records in reference table please use ReferenceOn(tableName) method.


 


If you need to insert or update the records in reference table please use FederationOnAll() method.

I’ll update this post when Shaun recovers from celebrating Chinese New Year (also big in Oakland) and delivers the promised explanation.


imageSee also Understanding Federated Database Servers: SQL Server 2008 R2 in the MSDN Library. SQL Server federation uses the Enterprise edition’s Distributed Partitioned Views. Related topics include:


Sharding Proxies for RDBMSs: dbShards and ScaleBase

Code Futures’s dbShards and ScaleBase are proxies that make multiple servers running MySQL look line a single RDBMS:

dbShards also supports PostgreSQL.


Hibernate Shards

Andrew Glover wrote Java development 2.0: Sharding with Hibernate Shards, Horizontal scalability for relational databases, on 8/31/2010 for IBM’s DeveloperWorks site. From the abstract:

Sharding isn't for everyone, but it's one way that relational systems can meet the demands of big data. For some shops, sharding means being able to keep a trusted RDBMS in place without sacrificing data scalability or system performance. In this installment of the Java development 2.0 series, find out when sharding works, and when it doesn't, and then get your hands busy sharding a simple application capable of handling terabytes of data.

Following are the Hibernate Shard resources for Java that Andy listed in his article:

  • Java development 2.0: This developerWorks series explores technologies and tools that are redefining the Java development landscape, including NoSQL (May 2010), SimpleDB (June 2010), and CouchDB (November 2009).
  • "Sharding with Max Ross: Hibernate Shards" (JavaWorld, July 2008): Andy Glover talks with the founder of Hibernate Shards about the motivations behind the project and how to leverage shards effectively.
  • "Think twice before sharding" (Andrew Glover, thediscoblog.com, June 2008): Sharding doesn't come without risks, so find out more about them before you commit.
  • "Rethinking the traditional DAO pattern" (Andrew Glover, thediscoblog.com, June 2008): Learn more about folding DAOs into normal domain objects, a la Grails and Rails.

NHibernate Shards

Ayende Rahien issued SQL Azure, Sharding and NHibernate: A call for volunteers, which has many interesting comments about SQL Azure and sharding, on 9/6/2009. His NHibernate Shards: Progress Report appeared on 10/18/2009. Stay tuned for more resources related to NHibernate Shards.

John Rayner demonstrated NHibernate Shards and provided source code in his Sharding into the cloud post of 11/10/2010 to the # Fellows blog:

NHibernate.Shards is an extension to the well-known ORM which allows a logical database to be partitioned across multiple physical databases and servers.  It's a port of the Hibernate.Shards project, as with lots of thing in NHibernate.  I thought it would be interesting to see how well it worked against SQL Azure.  It turned out to be not interesting at all ... just plain easy!

Step 1: Register with SQL Azure
Turnaround on token requests is pretty quick right now (<24 hours in my case).

Step 2: Setup some SQL Azure databases

Step 3: Setup appropriate logins, users
The SQL Azure team have done a great job to allow SQL Server Management Studio to connect a query window to an Azure database, but I'm a bit SQL-phobic at the best of times. This was the most challenging bit for me!

Step 4: Download and compile NHibernate.Shards from NHContrib Subversion

Step 5: Set your connection strings in the config file

Step 6: Press play.  Really, that's all there is to it!

Now you may notice that I neglected to create any schema in the Azure databases - that's because NHibernate can do that for me.  Did I mention that I'm a bit SQL-phobic?  [;)]

The code I was using was the standard example that comes with NHibernate.Shards, which records WeatherReport objects, which I've attached.  It's the same example that Ayende dissected, so you can also pick up his discussion of hards-progress-report.aspx" mce_href="http://ayende.com/Blog/archive/2009/10/18/nhibernate-shards-progress-report.aspx">the workings of NHibernate.Shards.  The code looks like this (click to enlarge):

sql-azure-shard-code

And the results are as follows (click to enlarge):

sql-azure-shards

Some of the features of NHibernate.Shards that really stood out for me:

  • It can query all shards in parallel or sequentially.  For SQL Azure, that's quite useful!  A sequential query my single-record shards took 601ms, whereas a parallelized query took 411ms (almost 33% less).
  • New records can be allocated to shards based on either the data (e.g. surname starts with A-M or N-Z) or some other scheme (e.g. round-robin).
  • If the correct shard can be identified based on an object's identity, then only that single shard is queried to retrieve the entity (this is based on your own IShardResolutionStrategy implementation).
  • If you sort by a property, then this sort will be applied even when data is merged from multiple shards.

Overall though, it all just works tremendously well.  Congratulations really must go to:

  • The Microsoft SQL Azure team
  • Dario Quintana, for his work on NHibernate.Shards
  • Fabio, Ayende and the rest of the NHibernate committers

EDIT: Querying data from the shards is done using code like the following.  You should notice that this code makes no references to the shards, and in fact is "normal" NHibernate code.  The sharding is all handled transparently in the ORM.

sql-azure-shard-code2

The source code is available from Shard sample.zip.


Generic Sharding and NoSQL Articles


0 comments: