OakLeaf Systems: Very Large Databases: Bricks, BitVault and BigTable

Thursday, April 06, 2006

Very Large Databases: Bricks, BitVault and BigTable

Microsoft Watch's Mary Jo Foley mentioned computing/data storage "bricks" in her 4/4/2006 "Microsoft Readies 'BitVault' Self-Healing Data Store" article.

This piece, which was based on a December 2005 Microsoft Research Technical Report (MSR-TR-2005-179), "BitVault: a Highly Reliable Distributed Data Retention Platform," and unnamed "sources close to the company," speculated that Microsoft "has moved BitVault into the product groups, intending to commercialize the technology" and "[t]he Clusters, File Systems and Storage team is now spearheading the BitVault project. ... Microsoft is hoping to be able to field a first-generation product based on BitVault within two years."

Update 4/2/2008: BitVault doesn't appear to have gotten off the ground "within two years" (almost to the day), but "smart bricks" have evolved to ISO intermodal containers as the unit of deployment of servers to data centers according to Mary Jo's Microsoft builds out its first containerized datacenter post of April 2, 2008. [Links to early Fawcette magazine articles and figures no longer are operational.]

The Technical Report, authored by Zheng Zhang, Qiao Lian, Shiding Lin, Wei Chen, Yu Chen and Chao Jin of Microsoft Research Asia, describes a peer-to-peer (P2P) architecture for very large content-addressable databases that store seldom-updated reference data. The BitVault is constructed from one to tens of thousands of "smart brick" building blocks. The paper defines a "smart brick" as a commodity "trimmed down PC with large disk(s)" that enables BitVault to be self-managing, self-organizing, and self-healing.

Jeremy Reimer's Ars Technica post of April 5, 2006, "Microsoft leverages P2P technology to create BitVault," briefly explains how BitVault employs P2P and distributed hashtable (DHT) technology. Structured P2P systems based on DHTs (P2P DHT) are a popular research topic. Click here for DHT links and here for P2P DHT links.

Related Papers from Microsoft Research Asia: "P2P Resource Pool and Its Application to Optimize Wide-Area Application Level Multicasting" (August 2004) describes the combination of P2P with DHT and a self-organzied metadata overlay (SOMO) to create the illusion of a single, large, dynamic resource pool. "SOMO: Self-Organized Metadata Overlay for Resource Management in P2P DHT" (February 2003) illustrates implementing arbitrary data structure in a structured P2P DHT with SOMO for resource management. "XRing: Achieving High-Performance Routing Adaptively in Structured P2P" (MSR-TR-2004-93, Septebmber 2004) and "Z-Ring: Fast Prefix Routing via a Low Maintenance Membership Protocol" discuss optimization of P2P routing for resource pools. "WiDS: an Integrated Toolkit for Distributed System Development" (June 2005) describes the development and test environment for BitVault.

Evolution of Microsoft Proposals for [Smart] Bricks

I first encountered the "brick" concept when covering the Paul Flessner/Pat Helland keynote at Tech*Ed 2002. Here's Pat's description of a service center (SC):

[A service center] is a unit of deployment. This is a thing that I would go and put a collection of Web services and their databases and all of the things it takes to support the Web services on the Internet and put that into the service center so it can be self-managing as much as possible and implement and support the Web services.

Pat went on to describe "bricks" as SC building blocks:

Now, a service center is implemented using bricks. A brick is just a system. It's got lots of cheap memory, lots of cheap disk, cheap and fast CPU, lots of cheap but fast networking. These bricks are going to plug into the service center. So you're going to plug it in and it hooks up and it finds what subnet it's on. It says, "Oh, here I am," and then the service center says, "Oh, you're here. I know what I can do with you. You've got all that storage. I'm going to make some more mirrored copies of my data onto you. I'm going some computation to you." All the human did was go whoomp and shove the darn thing in.

Research for my brief "Microsoft Adopts Medieval Web Services Architecture" article turned up several early references by Microsoft Research's Jim Gray to "bricks" for implementing very large database (VLDB) systems. The first reference to storage bricks that I found was on slide 31, "Everyone scales out. What's the Brick?," in Jim Gray's September 2000 "Building Multi-Petabyte Online Databases" presentation to NASA Goddard. A previous talk at the UC San Diego Supercomputer Center in October 1999 mentioned "CyberBricks" but not in the context of storage bricks.

Gray's "Two Commodity Scaleable Servers:a Billion Transactions per Day and the Terra-Server" white paper and "Scalable Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS" technical report (December 1999) describe CyberBricks as the "fundamental building block" of large Web sites. His "Storage Bricks Have Arrived" presentation at the 2002 File and Storage Technologies (FAST) conference cemented storage bricks as the implementation of choice for scaleable SQL Server databases.

Note: Jim Gray is well known for his contributions to VLDB research. Many of his 129 citations in the Database (DBLP) Bibliography Server target the design and performance of VLDBs. About half the citations (64 so far) have occurred during his tenure as a Microsoft Research Senior Researcher (from 1995) and Distinguished Engineer (from 2000).

About a year after the Flessner/Helland Tech*Ed 2002 and Gray FAST 2002 presentations, David Campbell, who was then Microsoft's product unit manager for the SQL Server Dabase Engine delivered "Database of the Future: A Preview of Yukon and Other Technical Advancements" as a keynote address for VSLive! San Francisco 2003. My "Build Data Service Centers With Bricks" article for Fawcette Technical Publication's Windows Server System Magazine (May 2003 Tech*Ed issue) summarized Campbell's presentation.

Campbell outlined Microsoft's database architecture du jour as multiple data service centers (SCs) created from groups of autonomous computing cells (ACCs) built from commodity server bricks as shown here:

Campbell envisioned each brick as an independent, replaceable computing unit running an instance of SQL Server 2005 with built-in data storage. Plugging a new brick into an ACC causes the SC to allocate bricks to services dynamically. Management ACCs and SCs communicate with SOAP Web service or Service Broker messages, instead of conventional remote procedure call (RPC) protocols. As noted in my article, Campbell asserted that:

The key to resiliency and scalability is moving reference data, such as product catalogs and customer information, to ACCs that have databases partitioned out to bricks. Product catalog data and customer information is relatively nonvolatile, and the temporary outage of a single brick affects only a percentage (one divided by the total number of bricks) of site visitors. Adding a brick to mirror each partition solves potential outage problems.

The article went on describe how the SC/ACC might improve system resiliency and scalability:

Microsoft's example of a service center is a high-traffic e-commerce site in which relatively static reference data (product and customer information) and shopping-cart state is delivered by partitioned databases in individual autonomous computing cells (ACCs). The Order Processing System is a conventional, shared nothing database cluster, not an ACC. Brown text identifies the elements that David Campbell's original PowerPoint slide didn't include.

Mary Jo Foley also covered the keynote in her 2/13/2003 "Yukon, Ho!" Microsoft Watch article. According to Mary Jo, Campbell said "Microsoft is building Yukon around these [SC, ACC, and brick] concepts."

A later Microsoft Research Technical Report MSR-TR-2004-107 by Jim Gray, et al. (October 2004), "TerraServer Bricks – A High Availability Cluster Alternative," describes the replacement of an active/passive cluster connected to an 18 terabyte Storage Area Network (SAN) with a duplexed set of "white-box" PCs containing arrays of large, low-cost, Serial ATA (SATA) disks. Gray calls the replacement system "TerraServer Bricks." Web Bricks have two 2.4-GHz Xeon hyperthreaded CPUs, 2GB RAM and two 80-GB SATA drives, and have a hardware cost of $2,100. Storage Bricks, which have a hardware cost of $10,300 and run SQL Server 2000, have the same CPUs, 4 GB RAM and 16 250-GB SATA drives.

Note: Microsoft Research's "Empirical Measurements of Disk Failure Rate and Error Rates" (MSR-TR-2005-166, December 2005) study indicates that SATA "uncorrectable read errors are not yet a dominant system-fault source—they happen but are rare compared with other problems. "

VLDB Implementation/Deployment Issues

Dare Obasanjo's "Greg Linden on SQL Databases and Internet-Scale Applications" post quotes Greg Linden:

What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in an out of service at will, read-write replication and data migration all done automatically. I want to be able to install a database on a server cloud and use it like it was all running on one machine.

Adam Bosworth, who managed the intial development of Microsoft Access, went on to found CrossGain (together with Tod Neilsen, Access marketing honcho), sold CrossGain to BEA and became BEA's chief architect and senior vice president, and now is VP Engineering at Google, lists these three features that database users want but database vendors don't suppply: Dynamic schema, dynamic partitioning of data across large dynamic numbers of machines, and modern [Googlesque] indexing. Adam wants the the Open Source community to "[g]ive us systems that scale linearly, are flexible and dynamically reconfigurable and load balanced and easy to use." Adam does mean give, not sell.

It seems to me that BitVault's smart bricks with appropriate deployment and management applications would fulfill Greg's and all but Adam's economic desires for reference data, which now constitutes more than half of the data stored by North American firms.

It's been more than six years since Jim Gray proposed CyberBricks and storage bricks for VLDBs but to the best of my knowledge, only Microsoft Resarch appears to be scaling out production databases (TerraServer) with bricks today. Microsoft still hasn't solved one of the major economic issues of storage bricks: licensing costs of Windows Server 2003 SP-1 R2 for each CPU compared with a no-cost Unix or Linux distribution. If Microsoft intends to productize BitVault successfully, a stripped-down, it seems to me that a lower-cost version of Windows Storage Server 2003 R2 is needed.

Grid data storage is a competing architecture that's gaining popularity in Europe but is just getting off the ground in North America. The EU DataGrid project's "objective is to build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities." The EU DataGrid project was merged into the EU EGEE (Enabling Grids for E-sciencE) in 2004. EGEE is now in its second development phase. Oracle and IBM both promote grid architecture for their databases.

BitVault's P2P architecture seems to me to be a viable challenger to grid data storage for read-mostly (archival) databases. I haven't seen an official grid-computing or grid-storage story from Microsoft.

Note: The Microsoft Research eScience Workshop 2005 (October 2005) presented sessions by "early adopters using Microsoft Windows, Microsoft .NET, Microsoft SQL Server, and Web services" for scientific computing. Many of the sessions dealt with .NET and grid computing/data storage.

Really Big Smart Bricks and Databases

The ultimate brick is a Google data center in a shipping container, as postulated by PBS's Robert X. Cringely (Mark Stephens) in his 11/17/2005 column, "Google-Mart: Sam Walton Taught Google More About How to Dominate the Internet Than Microsoft Ever Did." According to Stephens, whose bona fides were questioned several years ago:

We're talking about 5,000 Opteron processors and 3.5 petabytes of disk storage that can be dropped-off overnight by a tractor trailer rig [...] at Internet peering points. ... With the advent of widespread GoogleBase (again a bit-schlepping app that can be used in a thousand ways -- most of them not even envisioned by Google) there's suddenly a new kind of marketplace for data with everything a transaction in the most literal sense as Google takes over the role of trusted third-party info-escrow agent for all world business.

As Cringely mentioned in his subsequent (11/24/2005) column, The Internet Archive proposed a similar—but considerably smaller—implementation called the "petabox" which contained 800 low-cost PCs running Linux and 1 PB of data storage in a configuration that could be deployed in an 8- by 8- by 20-foot shipping container. Capricorn Technologies' commercial TB-80 PetaBox implemerntation stores 80 TB in a 19-inch rack, so 12 TB-80s combine to create an actual petabox. In mid-2004, the Internet Archive had a 80-TB rack running in San Francisco and a 100-TB rack operational in Amsterdam.

Clearly, large distributed file systems and databases must run on very big smart bricks or hardware of similar scale. Google probably probably runs the world's largest distributed file system. Google's "The Google File System" paper by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (October 2003) describes the then-largest GFS cluster as having more than 1,000 nodes and 300+ TB of storage. A GFS cluster has a single master and multiple chunkservers, and runs on commodity Linux boxes. A 64-bit GUID identifies each fixed-size (64-MB) file chunk. According to the paper, "For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace." The master contains system metadata and manages system-wide operations.

According to an abstract from a October 2005 presentation by Google's Jeff Dean at the University of Washington, "BigTable is a system for storing and managing very large amounts of structured data. The system is designed to manage several petabytes of data distributed across thousands of machines, with very high update and read request rates coming from thousands of simultaneous clients." A brief summary of Jeff's talk by Andrew Hitchcock is available here. (The Google Operating System Site is a good source of information on the latest Google Web apps.)

BigTable runs under the GFS. It's not clear if Google Base is a BigTable app but, in my opinion, the timing of Jeff's presentation and the start of the Google Base beta program is too close to be coincidental.

Don't Expect BitVault to RTM in the Near Future

On April 6, 2006, Paul Flessner presented a broad-brush overview of SQL Server's future to an audience of database users and computer press reporters in San Francisco. The primary item of (limited) interest was the rechirstening of SQL Server Mobile Edition as SQL Server Everywhere Edition, which Microsoft intends to RTM by the end of 2006. Mary Jo Foley reported in her "Microsoft Outlines (Vaguely) Its Database Road Ahead" article of the same date that Flessner "did acknowledge that Microsoft is looking at how to support 'content-addressable storage,' and that it had a project named BitVault that falls into this category. But he declined to say more."

Let's hope that BitVault and smart bricks get more attention from Microsoft executives and program/project management than first-generation data bricks received.

Technorati: Databases Content-Addressable Databases BitVault Distributed Hash Table DHT Bricks Smart Bricks Google File System GFS BigTable Google BigTable Microsoft Research Microsoft Research Asia SQL Server 2005