|A compendium of Windows Azure, SQL Azure Database, AppFabric, Windows Azure Platform Appliance and other cloud-computing articles.|
Note: This post is updated daily or more frequently, depending on the availability of new articles in the following sections:
- Azure Blob, Drive, Table and Queue Services
- SQL Azure Database and Reporting
- Marketplace DataMarket and OData
- Windows Azure AppFabric: Apps, Access Control, WIF and Service Bus
- Windows Azure VM Role, Virtual Network, Connect, RDP and CDN
- Live Windows Azure Apps, APIs, Tools and Test Harnesses
- Visual Studio LightSwitch and Entity Framework v4+
- Windows Azure Infrastructure and DevOps
- Windows Azure Platform Appliance (WAPA), Hyper-V and Private/Hybrid Clouds
- Cloud Security and Governance
- Cloud Computing Events
- Other Cloud Computing Platforms and Services
Jerry Huang described Bridging On-Premise and Cloud Storage in an 8/11/2011 post to his Gladinet blog:
Yesterday, Gladinet Cloud Storage Access Suite (Gladinet Cloud Desktop, Gladinet Cloud Backup, Gladinet CloudAFS) registered over a Million downloads. It is an important milestone, validating the market of bridging On-Premise storage and cloud storage.
There are several observations we have made from the growing Gladinet user base.
1. Cloud Storage needs Local Access Point
Cloud storage is in the Internet. However, people need easy and practical access from local IT infrastructure. In general, people call this the gateway or the on-ramp to cloud storage. The cloud gateway can be generalized to include desktop client, special backup agent or a central access point like a file server replacement.
In Gladinet’s product suites, the desktop client is Gladinet Cloud Desktop; the special backup agent is Gladinet Cloud Backup; the file server replacement is Gladinet CloudAFS. The three products has similar look-and-feel as they fit into a single platform for cloud storage access purpose.
2. Many Cloud Storage Services Co-exist
Since we are still in the early phase of cloud computing movement, there are many different vendors in the fields. There are many different technologies and many different cloud storage protocols too.
To name a few big ones in the field, there are Amazon S3, Rackspace Cloud Files, Windows Azure Storage, Google Storage for Developers, OpenStack-based, EMC Atmos-based and many more.
The market is more diversified than we had seen from the OS (such as Windows/Mac/UNIX) or from the mobile industry (iOS/Andriod/Symbian/…). It could be that it is in early development and consolidation will happen later on.
From the ground up, Gladinet has a broad range of cloud storage connectors, provided from a single Gladinet Cloud Access platform. Gladinet supports Amazon, Rackspace, Windows, Google, OpenStack, EMC and many more cloud storage services.
3. Hosted Service Provider Entering Market
Cloud storage service is new and currently dominated by Amazon S3. However, cloud storage service is a subset of hosted service. Many hosted service providers around the world are entering into the market. we have see companies like AT&T, Peer1, Internap, Korean Telecom started their own cloud storage services.
As it stands now, the Hosted Service Providers can buy technologies from EMC, Mezeo or Scality to build the cloud storage service. They can also build the in-house technology from OpenStack – Open Source implementation rooted from the Rackspace Cloud Files.
Gladinet partners with these companies and provides native support for EMC Atmos, OpenStack, Mezeo and Scality. This makes it easy for the hosted service providers to partner with Gladinet to provide an end-to-end solution to the customers, bridging customers’ on-premise IT infrastructure with the Hosted Service Provider’s cloud storage service.
4. Continued Cloud Storage Use Case Development
Earlier (several years ago), there is no cloud storage but there was online backup. Since then cloud storage use case continues to develop from backup to access to sync and more. As we observed and summarized it as the BASIC (Backup, Access, Sync, Integration/Identity, Control) cloud storage use case.
Avkash Chauhan described How to set Windows Azure Storage Provider for Commvault Enterprise Backup Solution in an 8/11/2011 post:
CommVault provides enterprise backup software, disk to disk backup, data protection, deduplication, e-discovery & a host of solutions for large & small businesses.If you are Commvault user, you can configure Windows Azure to use as a back service adapter for your Commvault backup applicaiton.
To configure Windows Azure Storage Adapter with Commvault, first you would need to create a Windows Azure Storage Service i.e. commvaultbackup (as below in this example) at https://windows.azure.com using your Windows Azure Service Account:
Note: You can download CloudBerry Azure Explorer from "http://cloudberrylab.com/?page=explorer-azure"
After the container is created you can verify your Windows Azure Storage as below:
Service Host: blob.core.windows.net (This field must be exactly as blob.core.windows.net)
Account Name: commvaultbackup (Windows Azure Storage Service Name)
Access Key : Please copy and paste the correct Azure Storage Access Key
Verify Access Key : Please copy and paste the correct Azure Storage Access Key again
Container: Please enter the container name created in above steps.
Once you select “OK”, you will see another dialog as below:
Now your configuration will be completed and if you open “commvaultcontainer” in Azure Storage Explorer, you will see a new folder as below:
If you hit the following error that means the configuration is incorrect. Be sure that you have correct Azure service name and your “Service Host” field is set to “blob.core.windows.net” exactly.
Error: [Cloud] There is a name lookup error.
See Valery Mizonov (@TheCATerminator) described a Hybrid Reference Implementation Using BizTalk Server, Windows Azure & SQL Azure in an 8/12/2011 post to the Windows Azure CAT blog in the Windows Azure Platform Appliance (WAPA), Hyper-V and Private/Hybrid Clouds section below.
Chris Woodruff made eight updates to the OData Primer on 8/12/2011:
Valery Mizonov (@TheCATerminator) posted a link to the source code for his Windows Azure Inter-Role Communication Using Service Bus article on 8/11/2011 to the Windows Azure CAT Team blog:
The sample demonstrates how Windows Azure developers can leverage the Windows Azure Service Bus to enable asynchronous inter-role communication in cloud-based solutions using Service Bus topics and subscriptions. The code sample offer a complete solution, one that enables simplified, scalable, loosely coupled communication between role instances on the Windows Azure platform.
The sample code is available for download from the MSDN Code Gallery. Please note that the source code is governed by the Microsoft Public License (Ms-PL) as explained in the corresponding legal notices.
The OakLeaf version of Valery’s original article of 8/2/2011 is here.
Also see Valery Mizonov (@TheCATerminator) described a Hybrid Reference Implementation Using BizTalk Server, Windows Azure & SQL Azure in an 8/12/2011 post to the Windows Azure CAT blog in the Windows Azure Platform Appliance (WAPA), Hyper-V and Private/Hybrid Clouds section below.
Liz Tay reported Westpac bursts risk analysis to Azure in an 8/12/2011 post to IT News for Australian Business:
Westpac has begun bursting part of its data crunching workload onto Microsoft’s hosted Azure platform to better inform the bank’s decision-making processes. [Link and emphasis added.]
“The only limitation really is the amount of computing power available, as complexity of calculations grows exponentially once you request to see finer and finer points,” he explained.
“At some stage, it becomes simply uneconomical to build an in-house platform with massive amounts of memory, processing power and so on, that will only be used for a few hours a night.”
As a major, established player in a highly regulated industry, Westpac tended to approach cloud computing conservatively.
In June, Westpac group technology executive Bob McKinnon revealed plans to consume email and collaboration software as a managed service from Microsoft and Fujitsu.
Westpac chief technology offer Sarv Girn has previously estimated the 'private cloud' to cost 30 to 40 percent less than what was previously required. Zaid said Westpac did not share any IT infrastructure with other tenants of Fujitsu’s Australian data centre.
National privacy laws, PCI compliance and guidelines published by Australian financial services regulator APRA typically discouraged banks from moving customer data offshore.
Those regulatory concerns did not apply to the analytics platform which, as Zaid said, involved “no customer data”.
Some 50 to 60 percent of calculations were performed in Microsoft's US-based cloud. The application itself remained on Westpac infrastructure to protect the bank’s intellectual property and avoid lock-in.
“Surprising enough, our risk and security gurus were comfortable with considering the cloud for this application,” Zaid told a Gartner Summit this week.
“There is no customer data involved, and the only information going into the cloud and back was a string of numbers which did not have any meaning to those outside the bank.
“Even if the US Government decides to capture it [under the US Patriot Act], so what,” he said.
Zaid said the application crunched publicly available data, including yield curves, commodity and security prices, indices, Reuters and Bloomberg feeds.
Only input and output data needed to be transferred from Azure to Westpac in Australia, so while computational requirements were high, traffic was low.
“By design, the actual system remains in-house,” Zaid said. “Only when the volume of calculations exceeds a set threshold, it will spill over the excess into the cloud.
“The idea behind the design was to retain the IP in-house and ensure that there is no lock-in from any vendor and we can move to any provider that we feel is best suited to that at the time.
“We considered Azure as Microsoft is one of our primary partners, as well as Amazon and a couple of other providers. Initial launch is with Azure in the States but there is capability to point our application to other clouds as well.”
Zaid urged his peers to look beyond the cloud computing hype, noting that there was “no point putting something in the cloud because it’s fashionable”.
The role of technologists, he said, was to ensure that IT products and environments were stable, well-designed and secure to support business outcomes.
“We all sit here saying, what are our challenges, what keeps us awake at night,” he said. “The worst thing about being a technologist is getting called in the middle of the night, saying that your system is broken.
“It’s our job to ensure that the systems work, no matter what the business is asking for, we can’t rush in. We have to do testing. We have to ensure the quality of the design, the security and so on.
"If we don’t do things right, [customers and employees] will suffer as a result and if we make their lives miserable, our lives will be miserable.”
The Windows Azure Team (@WindowsAzure) reported New Posts Explain Two Methods for Remote Access to Windows Azure VMs in an 8/12/2011 post:
Mario Kosmiskas just published two great new posts on the Distributed Development blog that walk through a couple of different ways to configure a service to allow for remote access to Windows Azure VMs.
- “Command Line Access to Azure VMs – PowerShell Remoting” focuses on enabling remote access in PowerShell.
- "Command Line Access to Azure VMs – SSH" explains how to enable access in SecureShell.
Mario posted the following Azure-related articles earlier this year:
The Interoperability Bridges Team explained how to Build and deploy a Windows Azure PHP application in a 8/8/2011 post (missed when published):
- Setup the Windows Azure Development Environment
- Setup the Windows Azure SDK for PHP
Recommended Additional Reading
In this tutorial you will learn how to build a full Windows Azure PHP application. You will be guided from nothing all the way through the deployment phase when your application will be public.
You will need to ensure you have properly setup the Windows Azure development environment and the Windows Azure SDK for PHP. See links in the Pre-Requisites section for more information.
Unlike other tutorials no sample files will be provided for this tutorial. All code needed will be contained within this document.
This tutorial will assume you are working out of theC:\temp\WindowsAzurePHPApp directory for all project files and commands.
Creating the Windows Azure PHP application base
Building a Windows Azure PHP application can be a complex process involving setting up the ServiceConfiguration.cscfg and ServiceDefinition.csdef files, as well as building scripts to install PHP and do any PHP customization you would like. Luckily the Windows Azure SDK for PHP provides a convenient scaffolding tool that will create a very basic Windows Azure PHP application structure for you. At the simplest this allows you to copy your project files in and immediately deploy, all the way to extreme customization, giving you control over virtually every aspect of your deployment. For the purposes of this tutorial a basic Windows Azure PHP application will suffice.
To create this basic Windows Azure PHP application, open a command prompt and run the following command:
scaffolder run -out=
You may now navigate to C:\temp\WindowsAzurePHPAppwhere you should see a file layout similar to the following:
Right now you could package and deploy this application and it would install PHP on your Windows Azure deployment for you; however it would not be doing much else.
A breakdown of the structure is as follows:
- PhpOnAzure.Web - This folder will be the document root of your application. All your application files go here
- bin - Contains the startups scripts that install PHP and perform other miscellaneous functions when you deploy
- php - Contains custom php.ini settings as well as any custom PHP extensions your application requires
- resources - Contains miscellaneous files which support deploying your application
- diagnostics.wadcfg - A basic diagnostics setup file. You can use this to track items such as CPU usage and network bandwidth
- ServiceConfiguration.cscfg - Contains the Windows Azure configuration settings for your deployment. This file gets uploaded along with the final package
- ServiceDefinition.csdef - Contains information about the setup of your deployment. This file is included in the final package that is uploaded to Windows Azure
This tutorial will not go into depth on the configuration files, however there will be a list of useful resources located at the end of this tutorial that will teach you more about these files.
Build the PHP application
Now comes the fun part, putting your PHP skills to use and building a shiny new application! For simplicity and ease you will be creating a simple PHP info page, but you will quickly see how easy it is to build a PHP application on top of the Windows Azure platform.
Inside the PhpOnAzure.Web folder create a new file named index.php and open it for editing. Add the following code and save the file.
Nothing fancy, but still a highly effective example.
Run the PHP application in the local development environment
Usually you will want to test your application before releasing it to production. There are two intermediate steps you can use to test before release; the local development environment, and the staging server. The staging server is available in the Windows Azure Portal and the process to utilize it is very similar to the production deployment. When you are satisfied with the way the staging application runs you are able to quickly switch from staging to production through the portal. We, however, are going to focus on the local development server in this tutorial.
The following steps will run your application in the local development server:
- Open a command prompt
- Run the command 'package create-in="C:\temp\WindowsAzurePHPApp"-out="C:\temp\WindowsAzurePHPApp\build" -dev=true'
- Your application will begin building and in a few seconds the default web browser will open and you should see the output ofphpinfo() similar to below
Run the PHP application on Windows Azure
To run any application on Windows Azure you need two files, a Windows Azure package and a ServiceConfiguration.cscfg. By slightly tweaking the previous command both files will be generated and can immediately be uploaded through the Windows Azure Portal.
- Open a command prompt
- Run the command 'package create-in="C:\temp\WindowsAzurePHPApp"-out="C:\temp\WindowsAzurePHPApp\build" -dev=false'
You will now have two new files inside ofC:\temp\WindowsAzurePHPApp\build that will be uploaded to Windows Azure; WindowsAzurePHPApp.cspkg and ServiceConfiguration.cscfg
It is now time to upload your application to Windows Azure. Instead of repeating the deployment instructions here please see the excellent article by Jas Sandu on Deploying your PHP application to Windows Azure.
After you have uploaded your files your deployment will begin to build. This generally takes several minutes as the Windows Azure service is creating a new Windows Server 2008 instance, installing PHP, and installing your application.
When the role state changes to busy your application is ready to be viewed.
Congratulations on building a full Windows Azure PHP application from the ground up!
The Windows Azure SDK for PHP supports powerful configuration features and the following are links to a few MSDN articles that will help you take advantage of the capabilities offered by the Windows Azure Platform
Wade Wegner (@WadeWegner) described Cloud Cover Episode 55 - Visual Studio LightSwitch with Jay Schmelzer in an 8/12/2011 post:
In this episode, Wade is joined by Jay Schmelzer—Principal Director of PM in Visual Studio Biz Apps—to discuss Visual Studio LightSwitch and its relationship to Windows Azure. Jay takes the time to explain the purpose of LightSwitch and show multiple demonstrations of how you can use services in the Windows Azure Platform to build line of business applications.
In the news:
- More Windows Azure Marketplace Content & Hands On Lab
- Microsoft Windows Azure Development Cookbook
- WA Toolkit for iOS: New Project Experience with Windows Azure Access Control Service
- WA Toolkit for iOS: New Project Experience for Accessing Windows Azure Storage
- Redirecting to HTTPS in Windows Azure: Two Methods
I’ve written about Visual Studio LightSwitch several times in this blog and in my Redmond Review column, including this month’s piece, “LightSwitch: The Answer to the Right Question.” All throughout, I’ve been pretty clear in my support of the product.
A little over two weeks ago LightSwitch shipped, and I think it’s off to a very good start. To help it along, I wrote a series of five whitepapers on LightSwitch for the product team, and they were just published by Microsoft. You can find them all by looking around the product’s site, but here are direct links to the PDFs for each paper:
- What is LightSwitch? ›
- Quickly Build Business Apps ›
- Get More from Your Data ›
- Wow Your End Users ›
- Make Your Apps Do More with Less Work ›
The first paper’s a bit of a wonkish piece on what makes LightSwitch different and why it’s needed in the market. After that formal opening, the papers get less “white” and instead walk through the product in detail, with an abundance of screenshots. If you’re curious about the product, this is an easy way to get a good look at it without having to install it or watch a video from beginning to end. I hope that even skeptics will start to see validity in the point I make several times over: while LightSwitch does a lot for you, it also gets out of your way and lets you do a bunch on your own. That’s a balance that I don’t think a lot of business application productivity tools attain.
The fifth paper covers LightSwitch extensions, which is a topic so late-breaking that I finished the paper less than a week ago. LightSwitch already has extensions offered by Infragistics, DevExpress, ComponentOne, RSSBus, and First Floor Software. Telerik has on its Web site a host of hands-on labs demonstrating how to use its Silverlight components in LightSwitch applications. Extensions from the community are already starting to pop up on the Visual Studio Gallery too. Together these offerings represent rather robust support for a fledgling product, and I expect the them, and the degree of integration in extensions, to continually improve.
Take a look at LightSwitch and keep a lookout for its progress and success. The best way to really get the product is to learn the tooling quickly, then think of a database and application you need to build out and see how fast you can get it running using the product. You may be surprised, not only by how quickly you finish, but by how sturdy and extensible the application you built actually is.
There are no guarantees, but I think LightSwitch product could really catch on.
Rowan Miller (explained the ADO.NET Team’s Next EF Release Plans in an 8/11/2011 post:
We recently posted about our plans to rationalize how we name, distribute and talk about releases. The feedback we have heard so far is confirming that we are headed down the right track. Following the plans we shared, the next installment of Entity Framework will be EF 4.2, this post will share the details about our plans for that release.
When we released ‘EF 4.1 Update 1’ we introduced a bug that affects third party EF providers using a generic class for their provider factory implementation, things such as WrappingProviderFactory<TProvider>. We missed this during our testing and it was reported by some of our provider writers after we had shipped. If you hit this bug you will get a FileLoadException stating “The given assembly name or codebase was invalid”. This bug is blocking some third party providers from working with ‘EF 4.1 Update 1’ and the only workaround for folks using an affected provider is to ask them to remain on EF 4.1. So, we will be shipping this version to fix it, this will be the only change between ‘EF 4.1 Update 1’ and ‘EF 4.2’. Obviously a single bug fix wouldn’t normally warrant bumping the minor version, but we also wanted to take the opportunity to get onto the semantic versioning path rather than calling the release ‘EF 4.1 Update 2’.
When is it Shipping?
One thing we learnt from ‘EF 4.1 Update 1’ is that we should always ship a beta, no matter how small the changes. We are aiming to have a beta available next week. Provided no additional problems are reported we plan to ship the RTM version in September.
Where is it Shipping?
The beta will be available as the EntityFramework.Preview NuGet package. The RTM version will be available as an update to the EntityFramework NuGet package. We will also make the T4 templates for using DbContext with Model First & Database First available on Visual Studio Gallery. We will no longer be shipping an installer on Microsoft Download Center.
What’s Not in This Release?
As covered earlier this release is just a small update to the DbContext & Code First runtime. The features that were included in EF June 2011 CTP are part of the core Entity Framework runtime and will ship at a later date. Our Migrations work is continuing and we are working to get the next alpha in your hands soon.
Valery Mizonov (@TheCATerminator) described a Hybrid Reference Implementation Using BizTalk Server, Windows Azure & SQL Azure in an 8/12/2011 post to the Windows Azure CAT blog:
Integrating an on-premise process with processes running in Windows Azure opens up a wide range of opportunities that enable customers to extend their on-premises solutions into the Cloud environment.
Based off real-world customer projects led by the Windows Azure Customer Advisory Team (CAT), this reference implementation comprises of a production quality, fully documented hybrid solution that demonstrates how customers can extend their existing on-premises BizTalk Server infrastructure into the Cloud. The solution is centered on the common requirements for processing large volumes of transactions originated from the on-premises system and off-loaded into the Windows Azure to take advantage of elasticity and on-demand compute power of the Cloud platform. The reference implementation addresses the above requirements and provides an end-to-end technical solution architected and built for scale-out.
The main technologies and capabilities covered by the reference implementation include: Windows Azure platform services (compute, storage), Windows Azure Server Bus, SQL Azure and BizTalk Server 2010.
The reference implementation is founded on reusable building blocks and durable patterns widely recognized as “best practices” in Windows Azure solution development.
This project is implemented as a hybrid solution in which BizTalk Server represents a fundamental dependency. Logically, the BizTalk Server customers can install and use the reference implementation with a minimum of modifications. However, the solutions also carry many of the reusable patterns and building blocks which the developer audience could explore in isolation from the larger end-to-end reference implementation.
The accompanying source code is available for download from the MSDN Code Gallery. Please note that the source code is governed by the Microsoft Public License (Ms-PL) as explained in the corresponding legal notices.
Components of the Solution
The BizTalk Server 2010 is a core element in the hybrid solution architecture. The transactions that need to be processed on the Cloud are originated from a BizTalk application running on the premises. The BizTalk application is also used for hosting the service endpoints which make mission-critical capabilities such as data transformation, BAM and BRE available to the cloud-based applications.
Business Activity Monitor (BAM)
The BAM is used to store activities that are generated and tracked by the BizTalk Sever application and Cloud services. All activities are collected into an on-premises BAM database and visible through the BAM portal.
Business Rules Engine (BRE)
The BRE is used by both the on-premises BizTalk application and cloud-based services to power up the decision making and define the operational aspects of the data processing on the Cloud. In addition, the BRE is utilized for authorizing and managing complex application configuration used by the hybrid solution. Lastly, the BRE policies provide the extensibility mechanism upon which custom activities can occur at runtime depending on type or content of messages flowing from and to the BizTalk application.
Windows Azure Service Bus
The Service Bus provides the endpoint in the cloud through which all traffic (both directions) are relayed. The Service Bus is also leveraged to provide inter-role communication between worker roles running on the Windows Azure platform.
Windows Azure Worker Role
The worker role provides the processing of messages, pulling each message from the queue, and shredding the collection of records into individual records that are stored in the SQL Azure database.
Windows Azure Queue
The storage queue is used to hold references to messages that are in line to be processed. Because the size of each item in the queue is limited to 8K, the actual messages are stored temporarily in a blob storage.
Hybrid Solution Architecture
The following diagram depicts the architecture of the hybrid reference implementation:
Click here for full-size image.
No significant articles today.
The Central Ohio Cloud Computing User Group announced on 7/18/2011 a Windows Azure and Windows Phone 7 Fire Starter – August 13th (missed when published):
The Central Ohio Cloud Computing User Group and Central Ohio Windows Phone User Group are joining forces to bring Central Ohio developers a Windows Azure and Windows Phone 7 Fire Starter! The event will be held on Saturday, August 13th at the Microsoft office in Columbus.
Do you like Windows Phone 7 development? Do you like Windows Azure development? Then let’s combine both of those loves in a single event! The fire starter will feature top Microsoft evangelists and community experts to lead you through the process of building Windows Phone 7 apps that leverage the flexibility and power of the Windows Azure platform. The Windows Phone 7 platform provides you with a innovative user experience. And Windows Azure provides you with a powerful platform from which to serve up your mobile applications. The firestarter will feature educational presentations to ramp up your skillset with these technologies, as well as time for hands on labs to flex your new skills.
If you’re not a Windows Phone 7 developer, fret not! There is still something for you! At the firestarter we’ll also take a quick look at how you can leverage Windows Azure to support your iOS or Android applications.
Please visit http://wazwp7firestarter.eventbrite.com to learn more and RSVP.
The Boston Azure Users Group announced a meeting about the Windows Azure Toolkit for Windows Phone 7 to be held 8/25/2011 at 6:30 PM EDT:
The next meeting of the Boston Azure User Group is fast approaching! Register now for the Thursday August 25, 2011 meeting.
This month's featured speaker is John Garland of Wintellect. John will introduce Windows Phone 7 programming to Windows Azure developers by showing how the Windows Azure Toolkit for Windows Phone 7 helps you build phone applications so easily.
When? Thursday August 25, 2011; see Meeting Topics for timing of specific activities.
Click here to register
FEATURED SPEAKER JOHN GARLAND
This presentation will show how the Windows Azure Toolkit for Windows Phone 7 can be used to quickly create Azure-enabled applications for the Windows Phone platform. This talk will also include a discussion of some of the new features that will be available in the upcoming Windows Phone Mango release due later this Fall, and how they can be used to further enhance the experience of working with Azure-based applications.
Klint Finley (@Klintron) posted From Big Data to NoSQL: The ReadWriteWeb Guide to Data Terminology (Part 3) to the ReadWriteEnterprise blog on 8/12/2011:
It's hard to keep track of all the database-related terms you hear these days. What constitutes "big data"? What is NoSQL, and why are your developers so interested in it? And now "NewSQL"? Where do in-memory databases fit into all of this? In this series, we'll untangle the mess of terms and tell you what you need to know.
In Part One we covered data, big data, databases, relational databases and other foundational issues. In Part Two we talked about data warehouses, ACID compliance, distributed databases and more. Now we'll cover non-relational databases, NoSQL and related concepts.
A non-relational databse is simply any type of database that doesn't follow the relational model. See Part One for a full definition of relational database.. Several types of database get lumped into this category: document-oriented databases, key-value stores, BigTable-clones and graph databases are the main ones, but there are a few others.
Non-relational databases tend to be used for big data or unbounded data. Although caching and in-memory data can mitigate many of the problems a relational database may encounter, there are times when data is either being updated too quickly or the data sets are simply too large to be handled practically by a relational database.
Non-relational databases are usually thought of as not being ACID compliant, but some some are.
NoSQL is shorthand for non-relational database. Some have suggested that "no" should in this case stand for "not only." According to the Wikipedia, the first known use was by Carlo Strozzi, but he was using it to refer to a relational database that didn't expose a SQL interface. He later said that the way the term is used now is more accurately "NoREL," not "NoSQL." Perhaps the earliest use of the term with its current meaning was the first NOSQL Meetup.
According to Wikipedia, the CAP theorem states that a distributed computer system can't guarantee all of the following:
- Consistency, all nodes see the same data at the same time
- Availability, a guarantee that every request receives a response about whether it was successful or failed
- Partition tolerance, the system continues to operate despite arbitrary message loss
The CAP theorem was proposed by computer scientist Eric Brewer of University of California, Berkeley and later proved by Seth Gilbert and Nancy Lynch of MIT.
BASE and Eventual Consistency
Most of the non-relational databases that have become popular in recent years are capable of being constructed as distributed databases as well, and that's often the reason that they are used. The CAP theorem tells us, however, that we can't always have both ACID levels of consistency and high availability. Many non-relational databases therefore use a different standard for consistency: BASE, which stands for Basically Available Soft-state Eventually [consistent].
In a distributed database system, different copies of the same data set may exist on several servers. We of course want these data sets to stay consistent. But one of the main points of having multiple servers is to improve performance. We don't necessarily want to tie up up each server every time a database table is updated. So we settle for "eventual consistency." "Eventual" usually actually means less than a second. The concept is discussed in
mygreater depth here.
MapReduce and Apache Hadoop
Map and reduce are methods in computer science for manipulating data. MapReduce is a distributed computing system created by Google specifically for working with large data sets and distributed data stores. It's named after these two computer science methods.
Apache Hadoop is an open source implementation of the ideas from a paper Google published on MapReduce. MapReduce is used in conjunction with a data store called BigTable, which we'll explain below. Hadoop has its own data store called HBase, which is based on ideas explained in a paper on BigTable.
The collection of computers that comprise a distributed computing system is called a "cluster." Each computer in the cluster is referred to as a "node." Here's how Wikipedia explains MapReduce:
- "Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
- "Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.
Sometimes the term "MapReduce" may refer to this programming method, other times it may refer to Google/Hadoop distributed computing approach (which is a specific application of these methods).
Key-value stores are schema-less data stores. As Oren Eini writes, "The simplest No SQL databases are the Key/Value stores. They are simplest only in terms of their API, because the actual implementation may be quite complex."
Using the blog example again (see the relational database section of Part One and the columnar database section of Part Two), a blog that uses a key-value store might store posts with just a key (say, a unique number for each post) and then everything else will be lumped together as the value. Here's our blog example in a very simple key-value store:
The lack of a schema can make it much more difficult to create custom views of the data, which is one of the reasons map and reduce are used to query many non-relational databases.
Document databases contain information stored in the form of documents. They are considered semi-structured. A document-oriented database is more complex than a key-value store, but simpler than a relational database.
Here's the old blog database example again:
Notice that in this version, there's no need for the "Categories" field to be set in the second document. Contrast this with the relational database, which had the field but left it blank.
This allows for more flexibility than a relational database, while providing more structure than a key-data store and making it easier to retrieve and work with the stored data. It would be much easier to create a list of posts with a common tag using a a document database than a plain key-value store.
BigTable-clone, Tabular Stores or Column Family Database
BigTable is a column-oriented, distributed, non-relational data store created by Google. Like MapReduce, Google published an academic paper detailing its workings. That lead to a few clones, most notably HBase, the data store used with Apache Hadoop. Apache Cassandra is also influenced by BigTable. Some refer to them as "tabular stores." Eini calls these "column family databases."
BigTable-clones are actually very similar to relational databases, but do away with a few structural elements. This necessitates using complex search methods, but makes it possible to scale to petabytes of size. BigTable and its clones store data in columns, super-columns and column-families. We can think of super-columns and column-families as "columns of columns."
Object databases apply the principles of object-oriented programming to database design. What this means is programmers can store data in databases using the same structure that the computer programs that will eventually access the program use. This means the program doesn't need to process the data into another structure when accessing the database. Instead of using a query language, such as SQL, a program can perform operations on the data immediately. One downside is that object databases are often tied to a specific programming language, such as Java or C++.
One of the most popular object databases is Objectivity/DB, which is used in applications ranging from computer aided drafting to telecommunications to scientific research.
More information on object databases can be found here.
Graph databases apply graph theory to the storage of information about the relationships between entries. The relationships between people in social networks is the most obvious example. The relationships between items and attributes in recommendation engines is another.
Yes, it has been noted by many that it's ironic that relational databases aren't good for storing relationship data. Adam Wiggins from Heroku has a lucid explanation of why that is. Short version: among other things, relationship queries in RDBSes can be complex, slow and unpredictable. Since graph databases are designed for this sort of thing, the queries are more reliable.
Popular examples include Neo4j and DEX.
We're partial to this illustration from a presentation by Peter Neubauer, the COO of Neo4j sponsor company Neo Technologies, Neo4j works:
Another good example comes from structur, a content management management system that uses Neo4j. This illustration shows how the graph database model can be applied to content (to a blog system, for example):
NewSQL is a term coined by 451 Group analyst Matthew Aslett to describe a a new wave of software projects that try to make RDBMSes more scalable.
- Drizzle, which tries to rebuild the popular RDBMS MySQL from the ground up.
- HandlerSocket, a MySQL plugin that provides key-value store functionality.
- VoltDB, an in-memory, distributed relational database that is fully ACID compliant.
Special thanks to Tyler Gillies for his help with this series
- Open Source Business Intelligence Vendor Pentaho Expands Its Big Data Support
- Twitter Will Open-Source Storm, BackType's "Hadoop of Real-Time Processing"
- Cloudera and Dell Announce Partnership for Turnkey Hadoop Solution
- Google and SAP Team-Up to Help You Visualize Big Data
- New Big Data Search Engine Combines CouchDB and Lucene
Windows Azure tables and blobs are examples of key/value stores. The concatenation of the PartitionKey and RowKey form the key for Azure tables (see the OakLeaf Systems Azure Table Services Sample Project demo from my Cloud Computing with the Windows Azure Platform book.)
For my take on Big Data stores, see my Choosing a cloud data store for big data post to SearchCloudComputing.com of 6/2011.
The above is Klint’s last post for ReadWriteWhatever. He’s moving to SiliconAngle.
Barton George (@Barton808) asked Does Hadoop compete with or complement the data warehouse? in an 8/12/2011 post:
Dell’s chief architect for big data, Aurelian Dumitru (aka. A.D.) presented a talk at OSCON the week before last with the heady title, “Hadoop – Enterprise Data Warehouse Data Flow Analysis and Optimization.” The session, which was well attended, explored the integration between Hadoop and the Enterprise Data Warehouse. AD posted a fairly detailed overview of his session on his blog but if you want a great high level summary, check this out:
Some of the ground AD covers
- Mapping out the data life cycle: Generate -> Capture -> Store -> Analyze ->Present
- Where does Hadoop play and where does the data warehouse? Where do they overlap?
- Where do BI tools fit into the equation?
- To learn more, check out dell.com/hadoop
- My Blog: Introducing the Dell | Cloudera solution for Apache Hadoop — Harnessing the power of big data
- Whitepaper: Introduction to Hadoop
- Whitepaper: Hadoop Business Cases
- AD’s blog post: Hadoop/EDW Integration session at OSCON 2011