Introducing Microsoft Codename “Cloud Numerics” from SQL Azure Labs
Introduction
Table of Contents
- “Cloud Numerics” Background
- The MSCloudNumerics.sln Project Template and Sample Solution
- “Cloud Numerics” Prerequisites (updated 1/28/2012, see below)
- Installing the HPC and “Cloud Numerics” Components
- “Cloud Numerics” Mathematic Libraries for .NET
- “Cloud Numerics” Distributed Array, Algorithm and Runtime Libraries for .NET
- Limitations of “Cloud Numerics”
- Running the MSCloudNumerics Sample Project Locally
- References
Updated 3/17/2012: Added details from Ronnie Hoogerwerf’s “Data Transfer” and “Cloud Numerics” better together post of 3/14/2012 and Roope Astala’s “Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data article of 3/8/2012 to the References section at the end of this post. Added to the end of the “Cloud Numerics” Background section Adam Hurwitz’s Support added for Cloud Numerics format post of 3/12/2012 to the Microsoft Codename “Data Transfer” Lab blog.
Updated 1/28/2012: Added Visual C++ as an (undocumented) required component of Visual Studio 2010 SP1. If the VC++ feature isn’t present, the alternative is to download and copy the Microsoft Visual C++ 2010 SP1 Redistributable Package (x64) .dll files to added folders that emulate those created by VS 2010 with VC++. These files are a prerequisite for enabling the Windows Azure HPC Cluster to run submitted executable files for a job. (See the “Cloud Numerics” Prerequisites section for details.)
Updated 1/25/2012: My (@rogerjenn) Deploying “Cloud Numerics” Sample Applications to Windows Azure HPC Clusters of 1/25/2012 describes how to configure and deploy two 8-core HPC clusters hosted in Windows Azure and submit the Latent Semantic Indexing (LSICloudApplication) project to the Windows Azure HPC Scheduler for processing.
“Cloud Numerics” Background
Codename “Cloud Numerics” is the latest in a series of new SQL Azure Labs tools for managing and analyzing Big Data in the Cloud with Windows Azure and SQL Azure. Ronnie Hoogerwerf’s introductory The “Cloud Numerics” Programming and runtime execution model post of 1/11/2012 to the Microsoft Codename “Cloud Numerics” blog begins:
Microsoft Codename “Cloud Numerics” is a new .NET® programming framework tailored towards performing numerically-intensive computations on large distributed data sets. It consists of
- a programming model that exposes the notion of a partitioned or distributed array to the user
- an execution framework or runtime that efficiently maps operations on distributed arrays to a collection of nodes in a cluster
- an extensive library of pre-existing operations on distributed arrays and tools that simplify the deployment and execution of a “Cloud Numerics” application on the Windows Azure™ platform
Writing numerical algorithms is challenging and requires thorough knowledge of the underlying math; typically this line of work is the realm of experts with job titles such as: data scientist, quantitative analyst, engineer, etc. Writing numerical algorithms that scale-out to the cloud is even harder. At the same time the ever increasing appetite for and availability of data is making it more and more important to be able to scale-out data analytics models and this is exactly what “Cloud Numerics” is all about. For example, with “Cloud Numerics” it is possible to write document classification applications using powerful linear algebra and statistical methods, such as Singular Value Decomposition or Principle Component Analysis, or to write applications that search for correlations in financial time series or genomic data that work on today’s cloud-scale datasets. [Links added.]
“Cloud Numerics” provides a complete [C#] solution for writing and developing distributed applications that run on Windows Azure. To use “Cloud Numerics” you start in Visual Studio with our custom project definition that includes an extensive library of numerical functions. You develop and debug your numerical application on your desktop, using a dataset that is appropriate for the size of your machine. You can read large datasets in parallel, allocate and manipulate large data objects as distributed arrays, and apply numerical functions on these distributed array[s]. When your application is ready and you want to scale-out and run on the cloud you start our deployment wizard, fill out your Azure information, deploy, and run you[r] application.
…
An important takeaway from the preceding excerpt is that the BigData input to “Cloud Numerics” applications must be a partitioned or distributed numeric array. You can load data into distributed arrays with data that implements the Numerics.Distributed.IO.ParallelReader interface or is processed by the sample Distributed.IO.CSVLoader class, which implements that interface.
Note: Source code for the Distributed.IO.CSVLoader class is included in the Cloud Numerics - Examples download, which is described in the Install the HPC and “Cloud Numerics” Components section below.
Ronnie’s Using Data post of 1/20/2012 is a useful reference for array data; it contains the following topics:
A rectangular array of numbers, symbols or expressions is called a matrix. Wikipedia has very detailed Matrix Theory and Linear Algebra topics. Matrix theory is a part of linear algebra.
CSV “Data Transfer” for “Cloud Numerics”
Update 3/17/2012: Ronnie’s “Data Transfer” and “Cloud Numerics” better together article of 3/14/2012 points to Adam Hurwitz’s Support added for Cloud Numerics format post of 3/12/2012 to the Microsoft Codename “Data Transfer” Lab blog:
Microsoft Codename "Cloud Numerics" is a SQL Azure Lab that lets you model and analyze data at scale. Now when you want to upload a file with Microsoft Codename "Data Transfer" to Windows Azure Blob storage for use in Cloud Numerics, you can choose to have the file converted to the Numerics Binary Format. This only applies to CSV and Excel files that contain numerical data ready for analysis with Cloud Numerics.
When uploading to blob you will be presented with a choice of output format.
When you select Numerics Binary Format, you will receive additional options regarding the file that you are uploading.
You can run a model in Windows Azure after the data transfer of the Numerics Binary Format file completes. Here is a C# example of a model that computes Eigen values using Cloud Numerics:
public static void Main(string[] args) { // Step 1: Initialize the Microsoft.Numerics runtime to create // and operate on Microsoft Numerics distributed arrays // DO NOT REMOVE THIS LINE NumericsRuntime.Initialize(); //Setup the Azure values used in the Data Transfer string account = "tbd"; // Azure Account string key = "tbd"; // Azure Storage Key string container = "tbd"; // Azure Container string file = "tbd"; // Name of Numerics Binary Format File //Load the Numerics Binary File into a distributed array var sr = new SequenceReader(account, key, containerName, fileName, 0); var berlinAdapterResult = Loader.LoadData<double>(sr); long[] shape1 = berlinAdapterResult.Shape.ToArray(); int ndims1 = berlinAdapterResult.NumberOfDimensions; Console.WriteLine("Container: {0}, File name: {1}", container, file); for (int i = 0; i < ndims1; i++) { Console.WriteLine("Dimension {0} has length = {1}", i, shape1[i]); } //Calculate the Eigen values var result = Decompositions.EigenValues(berlinAdapterResult); Console.WriteLine("Eigen values :\n {0}", result.ToString()); // Shutdown the Microsoft.Numerics runtime NumericsRuntime.Shutdown(); Console.WriteLine("Numeric Binary Format Successfully Read."); Console.WriteLine("Hit enter to continue ... "); Console.ReadLine(); }To learn more about deploying this model, please visit SQL Azure Labs Microsoft Codename “Cloud Numerics”.
The MSCloudNumerics.sln Project Template and Sample Solution
The first “Cloud Numerics” deliverable is a C# project template and sample program for Visual Studio 2010 Professional or Ultimate edition that takes advantage of the following newly available High-Performance Computing (HPC) components, which supersede Microsoft Research’s Dryad and DryadLINQ initiatives for high-performance, parallel computing in the cloud:
- Windows Azure HPC Scheduler, modules and features that developers can use to create Windows Azure deployments that support compute-intensive, parallel applications that can scale when offered more compute power
- HPC Pack 2008 R2 Client Utilities Redistributable Package with Service Pack 3, a stand-alone, and redistributable, installer for the Microsoft HPC Pack 2008 R2 Client Utilities
- HPC Pack 2008 R2 MS-MPI Redistributable Package with Service Pack 3, a stand-alone, and redistributable, installer for the Microsoft MPI (Message Passing Interface) implementation
An MSI installer for the “Cloud Numeric” software sets up the following components for Visual Studio 2010:
- Math, Statistics, and Signal Processing libraries as managed Dynamically Linked Library (DLL) files.
- DLLs for initializing and running jobs on Windows Azure used by the Math, Signal, and Statistics libraries.
- Associated IntelliSense files for the DLLs.
- A project deployment template and utility for deploying your application package to Azure.
I’ve covered the following three earlier SQL Azure Labs with illustrated, multi-part tutorials and overview articles:
- Microsoft Codename "Data Explorer" provides a new way to organize, manage, mash up and gain new insights from your data.
- Microsoft Codename "Data Transfer" is an easy-to-use Web application to move your data into SQL Azure or Windows Azure Blob storage.
- Microsoft Codename "Social Analytics" is an experimental cloud service for integrate social web information into business applications.
All four SQL Azure Labs projects require self-nomination for access. Sign up for “Cloud Numerics” here. The above three projects require an invitation code for access to resources; “Cloud Analytics” doesn’t.
“Cloud Numerics” Prerequisites
The project template and sample program have the following operating system and software prerequisites:
- Windows 7 or Windows Server 2008 R2 SP1, 32 or 64-bit
- Visual Studio 2010 Professional or Ultimate Edition with SP1 with Visual C++ components installed*
- SQL Server 2008 R2 Express or higher
- Windows Azure SDK v1.6 and Windows Azure Tools for Visual Studio, November 2011 or later edition
- A Windows Azure subscription for deploying projects from local, debugging mode to Windows Azure.
If any of the following obsolete components are present, installation will appear to succeed but you probably won’t be able to open a new “Cloud Numerics” project:
- Microsoft HPC Pack 2008 R2 Azure Edition
- Microsoft HPC Pack 2008 R2 Client Components
- Microsoft HPC Pack 2008 R2 MS-MPI Redistributable Pack
- Microsoft HPC Pack 2008 R2 SDK
- Windows Azure SDK v1.5
- Windows Azure AppFabric v1.5
- Windows Azure Tools for Microsoft VS2010 1.5
*Update 1/28/2011: The build script expects to find msvcp100.dll and msvcr100.dll files installed by VS2010 in the C:\Program Files(x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.CRT folder and the msvcp100d.dll and msvcr100d.dll debug versions in the C:\Program Files(x86)\Microsoft Visual Studio 10.0\VC\redist\Debug_NonRedist\x64\Microsoft.VC100.DebugCRT folder. If these files aren’t present in the specified locations, the Windows Azure HPC Scheduler will fail when attempting to run MSCloudNumericsApp.exe as Job 1.
After you install VS 2010 SP1, attempts to use setup’s Add or Remove Features option to add the VC++ compilers fail.
Note: You only need to take the following steps if you intend to submit the application to the Windows Azure and you don’t have the Visual C++ compilers installed:
- Download the Microsoft Visual C++ 2010 SP1 Redistributable Package (x64) (vcredist_x64.exe) to a well-known location
- Run vcredist_x64.exe to add the msvcp100.dll, msvcp100d.dll, msvcr100.dll and msvcr100d.dll files to the C:\Windows\System32 folder.
- Create a C:\Program Files(x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.CRT folder and copy the msvcp100.dll and msvcr100.dll files to it.
- C:\Program Files(x86)\Microsoft Visual Studio 10.0\VC\redist\Debug_NonRedist\x64\Microsoft.VC100.DebugCRT folder and add the msvcp100d.dll and msvcr100d.dll files to it.
Stay tuned for more details about this issue.
Installing the HPC and “Cloud Numerics” Components
Follow the instructions in the Microsoft Codename "Cloud Numerics" wiki article’s “Software Requirements” section to install the four components listed earlier.
Note: Links to http://connect.microsoft.com/ in the wiki article won’t work because you don’t receive an invitation code to enter.
Go directly to the "Cloud Numerics" Microsoft Connect Site to download:
- Installer for "Cloud Numerics": Downloads the MSI installer for "Cloud Numerics." Make sure to also download the documentation for "Cloud Numerics" (see below).
- Release Notes
- Getting Started with "Cloud Numerics": This document walks you through the installation and deployment process. By the end you should have a "Cloud Numerics" application running on Azure. For online version see here.
- Documentation for "Cloud Numerics": Documentation in the form of a Windows Help file (.chm). Download the file, save it in an easy to remember location, and open by double-cliking the file (see below).
- "Cloud Numerics" architecture white paper: White paper describing the architecture and technology behind "Cloud Numerics"
- Example applications: Three end-to-end examples: Latent Semantic Indexing, Statistics, and Time-series
The wiki article’s Simple Examples section includes several example programs that you can run by replacing the code in the MSCloudNumerics project’s Sample.cs file. (See the “Run the MSCloudNumerics Sample Project” section below.)
“Cloud Numerics” Mathematic Libraries for .NET
The CloudNumericsLab.chm help file provides the details of the Microsoft.Numerics classes and their members’ syntax, categorized by namespace:
Note: The sample application that follows uses the Cholesky Decomposition. You can replace the code MSCloudNumerics sample application’s Program.cs file with sample code from the help files.
This table from the TechNet wiki article describes the Cloud Numerics Mathematic Libraries for .NET:
“Cloud Numerics” Distributed Array, Algorithm and Runtime Libraries for .NET
This table from the Tech*Net wiki article describes the Cloud Numerics Distributed Array and Runtime Libraries for .NET.
Limitations of “Cloud Numerics”
From Ronnie’s The “Cloud Numerics” Programming and runtime execution model post of 1/11/2012:
First, the “Cloud Numerics” programming model is primarily based around distributed array operations (c.f. data parallel or SIMD-style of programming). Certain relational operations such as “selects” with user-defined functions or complex joins are simpler to express on top of languages such as Pig, Hive and SCOPE. Similarly, while “Cloud Numerics” is designed to deal with large data sets, it is currently constrained to operate on arrays that can fit in the main memory of a cluster. On the other hand, data on disk can be pre-processed via existing “big data” processing tools and ingested into a “Cloud Numerics” application for further processing.
Second, “Cloud Numerics” is not just a convenient C# wrapper around message-passing libraries such as MPI, for example MPI.NET [3]; all aspects of parallelism are expressed via operations on distributed arrays and the “Cloud Numerics” runtime transparently handles the efficient execution of these high-level array operations on a cluster.
A key aspect that distinguishes “Cloud Numerics” from parallelization techniques such as PLINQ and DryadLINQ [4], that are based on implementing a custom LINQ provider, is that parallelization in “Cloud Numerics” occurs purely at runtime and does not involve any code generation from (say) LINQ expression trees; a user’s application can be developed as a regular .NET application by referencing the “Cloud Numerics” runtime and library DLLs and executed on the cluster in Azure.
Finally, the underlying communication layer in “Cloud Numerics” is built on top of the message passing interface (MPI) and does inherit some of the limitations in the underlying implementation such as:
- The process model is currently inelastic; once a “Cloud Numerics” application has been launched on (say) P cores in a cluster, it is not possible to dynamically grow or shrink the resources as the application is running.
- The implementation is not resilient against hardware failure. Unlike frameworks like Hadoop that are designed explicitly to operate on unreliable hardware, if one or more nodes in a cluster fails, it is not possible for a “Cloud Numerics” application to automatically recover and continue executing.
On the other hand, having MPI as the underlying communication layer in the “Cloud Numerics” runtime does endow it with certain advantages. For instance, “Cloud Numerics” applications can automatically take advantage of high-speed interconnects such as Infiniband between nodes in a cluster and optimizations such as zero-copy memory transfers and shared-memory-aware collectives within a single multi-core node. More importantly, array operators in “Cloud Numerics” can leverage the vast ecosystem of high-performance distributed memory numerical libraries such as ScaLAPACK built on top of MPI.
Running the MSCloudNumerics Sample Project Locally
1. Launch Visual Studio, choose New, Project, Visual C#, and select Microsoft Cloud Numerics Application:
2. Click OK to generate a new MSCloudNumerics1 console project and press F5 to run it. Mark the Windows Security Alert’s Private Networks check box:
Note: The firewall must permit interprocess communication between cores on your machine in the form of network calls to localhost.
3. Click Allow Access for the application to close the dialog and repeat step 2 for the HPC MPI Process manager.
4. Click Allow Access to close the dialog. The console displays the dimensions of the distributed array processed by the following code:
Note: Wikipedia has more information about the Choleski Decomposition.
5. While the application is running, launch TaskMan and display the CPU cores’ usage:
Note: The lab release of the local distributed application runs on a maximum of two cores. Microsoft states that you will be able to specify the number of cores in future versions. My development computer runs Windows 7 on a 2.83 GHz Q9550 Intel Core 2 Quad CPU on a DQ45CB motherboard with 8 GB of RAM.
6. Press Enter to close the console.
The application’s references include the Microsoft.Numerics namespaces from C:\Program Files\Microsoft Numerics\v0.1\Bin:
References
Ronnie Hoogerwerf’s (pictured at right) and Roope Astala’s articles from the Microsoft Codename “Cloud Numerics” blog, in chronological order:
- Announcing Microsoft Codename “Cloud Numerics” (1/10/2012)
- The “Cloud Numerics” Programming and runtime execution model (1/11/2012)
- “Cloud Numerics” Example: Latent Semantic Indexing and Analysis (1/13/2012)
- “Cloud Numerics” Example: Distributed Numerics on Azure with F# (1/16/2012)
- “Cloud Numerics” Example: Using the IParallelReader Interface (1/18/2012)
- Using Data (1/20/2012)
- “Cloud Numerics” Example: Statistics Operations to Azure Data (1/30/2012)
- FAQs, Best Practices, Issues and Workarounds (2/6/2012)
- Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data (3/8/2012)
- “Data Transfer” and “Cloud Numerics” better together (3/14/2012)
My (@rogerjenn) Deploying “Cloud Numerics” Sample Applications to Windows Azure HPC Clusters of 1/25/2012 describes how to configure and deploy two 8-core HPC clusters hosted in Windows Azure and submit the Latent Semantic Indexing (LSICloudApplication) project to the Windows Azure HPC Scheduler for processing.
Stay tuned for additional tutorials detailing local execution of Statistics and Time-series application, as well as deployment of these sample projects to Windows Azure.
1 comments:
Thank you for this long example, it's really helped me to learn it better.
Post a Comment