In addition to the basic MSCloudNumerics Visual Studio template and “Cloud Numerics” sample application described in my Introducing Microsoft Codename “Cloud Numerics” from SQL Azure Labs of 1/23/2011, the "Cloud Numerics" Microsoft Connect Site’s Example applications download offers three additional end-to-end examples:
- Latent Semantic Indexing (LSI) document classification example
- Statistics functionality demonstration
- Time-series analysis of cereal yield data
This post describes how to configure and deploy two 8-core HPC clusters hosted in Windows Azure and submit the LSICloudApplication to the Windows Azure HPC Scheduler for processing. Configuring, deploying and submitting the other two examples differs only slightly.
Update 1/30/2012: Added review and link to a PDF copy of David Skillcorn’s Understanding Complex Data Sets: Data Mining with Matrix Decomposition (Chapman & Hall/CRC, 2007) in the “Learning about the Latent Semantic Indexing Example” section below.
Update 1/28/2012: Added step 2 to the “Extracting and Running the Latent Semantic Indexing Example” section to change build configuration from the default Debug to Release. This change prevents failure of the job submitted to the Windows Azure HPC Scheduler in step 33.
Note: See the Introducing Microsoft Codename “Cloud Numerics” from SQL Azure Labs post’s “‘Cloud Numerics’” Prerequisites section (updated 1/28/2012) if your Visual Studio 2010 installation doesn’t have the Visual C++ features installed.
Table of Contents
- Creating a Windows Azure Pay-As-You-Go Subscription
- Learning about the Latent Semantic Indexing Example
- Extracting and Running the Latent Semantic Indexing Example
- Configuring and Deploying the Windows Azure HPC Cluster
- Using the Windows Azure HPC Scheduler Web Portal
- Running the Latent Semantic Indexing Example Locally
Creating a Windows Azure Pay-As-You-Go Subscription
The example applications store their results in Windows Azure blobs, so you must have a Windows Azure subscription with a Windows Azure storage account and SQL Azure database to run these projects. Unlike other SQL Azure Labs CTPs, such as Apache Hadoop-based Services for Windows Azure (which is free), users must provide and pay for their own resources. You can’t take advantage of the Windows Azure 3 Month Free Trial because it provides 750 hours per month of a small compute instance. Compute instances of various sizes aren’t interchangeable with trial subscriptions. For these reasons, “Cloud numerics” requires a pay-as-you-go subscription:
This tutorial assumes that you have a pay-as-you-go subscription. If you don’t, go to the Account page, click Add Subscription and sign up with your credit card.
A deployed Codename “Cloud Numerics” application with the minimum number of head (1), compute (2) and Web FrontEnd (1) nodes will cost you $2.16/hour while it’s deployed. Compute nodes are Extra Large instances, which use 8 cores each, so 18 cores are deployed at US$0.12/core*hour. You also will be charged about US$0.33/day (US$9.99/month) for a 1-GB SQL Azure Web edition database.
Note: You can run the LSICloudApplication project locally with a 3-Month Trial Subscription’s Windows Azure storage account. See the “Running the Latent Semantic Indexing Example Locally” section near the end of this post for details.
1. Select your pay-as-you-go subscription and click the Manage button to open the Windows Azure Management portal.
2. Click the Hosted Services, Storage Accounts & CDN button and the Active button in the Deployment Health Status column to display your active Subscription(s):
Note: You will use your Subscription ID in the next section.
Learning about the Latent Semantic Indexing Example
Ronnie Hoogerwerf’s “Cloud Numerics” Example: Latent Semantic Indexing and Analysis post of 1/13/2012 to the Microsoft Codename “Cloud Numerics” blog describes this example as follows:
How do you find similar or related records or documents within a collection of unstructured data or a text corpus? One classical solution to this problem is Latent Semantic Indexing (LSI). In this blog post we’ll walk you through the steps of applying the LSI technique. We will use a subset of public SEC 10-K filings by companies as the text corpus to analyze. We will use LSI to capture similarities between these companies based on the information in their 10-K filings. [Wikipedia link added.]
LSI is a method grounded in linear algebra and Singular Value Decomposition (SVD*). The “Cloud Numerics” library offers an implementation of SVD that works on distributed arrays. This distributed array support enables the LSI solution to be scaled out over a cluster of Windows Azure nodes. [Wikipedia link added.]
This example focuses on the data computation and the examination of the correlated results. It does not discuss the one-time preparation to clean, tokenize, and stem each document in order to assemble the word counts. These pre-processing steps are well suited to a map/reduce framework, such as Hadoop. The preprocessed dictionaries of terms and term counts are available by way of the following Windows Azure storage containers:
The C# LSI example constructs and operates on a term-document matrix --an array where columns correspond to documents, rows correspond to individual terms, and the elements correspond to logarithmically scaled frequencies of terms in each document.
Ronnie continues with details of the seven steps involved in performing the calculations and writing the result to a Windows Azure blob.
* Update 1/30/2012: SVD is one of the topics of David Skillcorn’s Understanding Complex Data Sets: Data Mining with Matrix Decomposition (Chapman & Hall/CRC, 2007), which Alexander Stojanovic (@stojanovic, pictured below) reviewed in his Understanding Complex Datasets through Matrix Decomposition post of 12/31/2011:
David Skillcorn's book on matrix decomposition techniques is superb. I especially enjoyed his coverage on non-negative matrix factorization (NNMF) techniques and eigendecomposition (i.e. spectral techniques). I would recommend the book to those interested in data mining and knowledge extraction. The techniques cover a wide range of media and are not simply restricted to relational datasets and textual documents.
The treatment of PageRank is concise and articulate: demonstrating the deep relationship between graph mining and learning techniques and matrix decomposition (SVD amongst others) techniques that make search engines such as Google and Bing possible.
As a reviewer summarized, "The author explores the deep connections between matrix decompositions and structures within graphs, relating the PageRank algorithm of Google's search engine to singular value decomposition. He also covers dimensionality reduction, collaborative filtering, clustering, and spectral analysis. With numerous figures and examples, the book shows how matrix decompositions can be used to find documents on the Internet, look for deeply buried mineral deposits without drilling, explore the structure of proteins, detect suspicious emails or cell phone calls, and more."
Extracting, Running and Deploying the Latent Semantic Indexing Example
To run the Latent Semantic Indexing example in Visual Studio 2010 Professional or higher, do the following:
1. Extract LSICloudApplication.zip to the \My Documents\Visual Studio 2010\Projects folder:
Extracting the files creates a \My Documents\Visual Studio 2010\Projects\LSICloudApplication subfolder.
2. Choose Build, Configuration Manager to open the Configuration Manager dialog, select Release in the Active Solution Configurations list, and click OK to change the build configuration from Debug to Release:
Note: For currently unknown reasons, submitting a job built in the debug configuration fails. This problem is under investigation. This step was added on 1/28/2012.
3. Press F5 to build and run the application for the first time. If you receive errors and warnings as partially shown here due to missing references in the DictionaryReader.cs component, you probably have installed the project to C:\Users\UserName\My Documents\Visual Studio 2010\Projects\LSICloudApplication\LSICloudApplication, instead of C:\Users\UserName\My Documents\Visual Studio 2010\Projects\LSICloudApplication. Copy the contents of the installation folder to the first LSICloudApplication folder, overwriting the contents of the subfolder.
4. Verify that the AppConfigure is set as the startup project and press F5 to run the project again. After a minute or two the following Cloud Numerics Deployment Utility dialog opens:
Configuring and Deploying the Windows Azure HPC Cluster
5. Open the Windows Azure Management Portal, click the Hosted Services, Storage Accounts & CDN button and the Active button in the Deployment Health Status column to display your active Subscription(s).
6. Select the subscription to use for this and the following examples and copy the Subscription ID to the clipboard (see step 2 of the preceding section).
7. Paste the value into the Subscription ID text box.
8. Click the Cloud Numerics Deployment Utility dialog’s Create button to create a new Management Certificate for the service from the Certificate Name dialog. Accept the default Certificate Name, click Browse to open the Specify a New .cer File dialog, and navigate to a well known location to save the file:
Note: You can have up to 10 Management Certificates for a hosted service.
9. Click Open and Save to save the certificate file and return to the Specify a New .cer File dialog:
10. Click OK to close the dialog and open the Security Warning dialog:
Note: This certificate will appear in Trusted Root Certification Authorities list of Internet Explorer Tools’ Certificates dialog:
11. Click Yes to install the self-signed certificate, close the dialog, and confirm the certificate:
12. Click OK to close the message box and return to the Deployment Utility.
13. Return to the Windows Azure Management Portal, click the Management Certificates button, select your pay-as-you-go subscription in the list, and click the Add Certificate button to open the Add New Management Certificate dialog:
14. Click the Browse button to open the Open dialog, navigate to the folder in which you saved the certificate in step 8, and double-click MicrosoftCloudNumerics.cer to close the dialog and add it as the Certificate file:
15. Click OK to close the dialog and, after the certificate’s status is Created, return to the Deployment Utility dialog.
16. Type a globally unique DNS prefix name for the service, oakleaf1lsi for this example, in the Service Name text box:
17. Open the Location list and choose a region in which to host the service, North Central US for the LSI example:
Note: If you encounter the following error message:
wait a minute or two and try again.
18. Click Next to activate the Cluster in Azure tab, type and existing or new name for the Cluster Administrator, a complex password and password confirmation, accept 1 Head and Web FrontEnd node, and specify 2 (the minimum) for Compute nodes, which are Extra Large Instances having 8 cores each:
Note: If you select 3 Compute Nodes, you receive the following message near the end of the deployment process because 3 compute nodes consume 24 cores:
If you want to increase the default 20-core quota for your pay-as-you-go subscription, contact Windows Azure Billing Support with Quota Increase as the Support Topic and Windows Azure Compute Instances as the subtopic:
19. Click Next to activate the SQL Azure Server to associate with your server. Accept the defaults shown here, unless you want to customize the server instance:
Note: Deploying an SQL Azure server isn’t optional. The SQL Azure server’s DNSPrefixName database, oakleaflsi for this example, stores deployment metadata in 54 tables, one of which is named Zombie:
20. Click the Deployment Action frame’s Configure Cluster button to start the cluster configuration process:
21. Return to the Windows Azure Management Portal, click the Storage Accounts button and confirm that the oakleaf1lsi storage account has been created:
Note: Clicking the Primary Access Key’s View button displays it in this dialog (partially obscured for security here) and lets you copy it to the clipboard:
You will use the storage account Name and Primary Access Key values later in this section.
22. Click the Hosted Services button and verify that the oakleaf1lsi.cloudapp.net hosted service and its service certificate has been created:
Note: Cores used = 0 because the services haven’t been deployed yet.
23. When the Configuration Complete message appears, click the Deploy Cluster button. The progress bar displays the following deployment operations:
Note: It took me a loooong time (about an hour and 15 minutes) to upload the 205 MB HPC package to blob storage on a slow (410 kbps up) DSL line:
24. When deployment is complete, the Deployment Utility dialog appears as follows:
25. Click Check Cluster Status to display the current condition:
26. Return to the Windows Azure Management Portal and verify the Hosted Services deployment with 18 cores being used at a cost of US$2.16 per hour:
27. Use Cerebrata’s Cloud Storage Studio or a similar application to inspect the uploaded cluster blob:
Running the LSICloudApplication locally created the lsiresult blob. You must run the application locally to create the *.exe and copy local the required libraries, as described in the “Running the Latent Semantic Indexing Example Locally” section near the end of the chapter. Return here after completing the operation.
28. Click the Deployment Utility’s Application Code tab, click the Browse button, and navigate to the \My Documents\Visual Studio 2010\Projects\LSICloudApplication\LSICloudApplication\MSCloudNumericsApp\bin\Debug folder and double-click the MSCloudNumericsApp.exe executable file to add its path and filename to the text box and click the Submit Job button:
The status bar appears as follows while uploading the MSCloudNumericsApp.exe file to blob storage:
29. When the Job Successfully Submitted message appears, close the dialog.
Using the Windows Azure HPC Scheduler Web Portal
30. To check the status of the jobs submitted, type the URL for the hosted service (https://oakleaf1lsi.cloudapp.net for this example) and click the Continue to Web Site link to accept the self-signed certificate and open the Sample Application page:
31. Click the Windows Azure HPC Scheduler Web Portal Link and type your administrative user name and password in the Windows Security dialog:
32. Click OK to open the HPC Scheduler Web Portal. You might receive the following and another non-fatal error message:
33. Click the All Jobs button once or twice to display their current state. The MSCloudNumerics job finished execution in this example:
34. Click the link to MSCloudNumericsApp.exe to open a detail page:
35. Click the View Tasks tab to display more information about the task:
The complete Summary entry is:
-------------------------- Summary --------------------------
3 Nodes succeeded
0 Nodes failed
36. If you don’t want to be charged US$2.16 per hour for compute instances and US$0.33/day for the database, return to the Windows Azure Management Portal and delete the Deployment for Windows Azure and hosted service (oakleaf1lsi for this example), as well as the database created during the configuration process.
Running the Latent Semantic Indexing Example Locally
The Introducing Microsoft Codename “Cloud Numerics” from SQL Azure Labs post’s MSCloudNumerics solution is self-contained and uses locally-generated matrices with random values as its input data. In local mode, the LSICloudApplication downloads text data from http://cloudnumericslab.blob.core.windows.net’s public financialdocs container, processes it with the local copy of the Microsoft HPC Pack 2008 R2 MS-MPI Redistributable Pack (Mpiexec.exe), and uploads the results to the Windows Azure storage account you created in the preceding section.
The AzureSampleService project contains a Roles node with ComputeNode, FrontEnd and HeadNode subnodes, as well as ServiceConfiguration.Cloud.csfg and ServiceConfiguration.Local.csfg configuration files. The latter is used when running the LSICloudApplication locally.
Correcting Erroneous Debug Command Line Arguments and Working Directory Values
Early versions of the example programs specify a user named roastala (Roope Astala, a program manager on the Codename “Cloud Numerics” team in the LSICloudApplication project’s Debug properties page’s Command Line Arguments and Working Directory property values. To determine if your version has this problem, do the following:
1. Right click the LSICloudApplication node and choose Set As Startup Project.
2. Press F5 and wait for the console window to appear.
3. If you see the following message:
3a. Click OK to close the message box, right-click the LSI CloudApplication node and choose Properties to open its properties pages.
3b. Replace the two instances of roastala in the text boxes:
with your user name.
3c. Press Ctrl+Shift+B to build the solution with the new user name.
Specifying the myAccountKey and myAccountName Variable Values in Program.cs
4. Open LSICloudApplication project’s Program.cs file, scroll to line 224, replace myAccountKey by pasting the storage account’s Primary Account Key value, which you can copy to the clipboard as shown in step 20 of the preceding section, and replace myAccountName with the name of your storage account, oakleaf1lsi for this example.
5. Press F5 to run the project locally. Allow firewall access for two processes, if this is the first time you’ve run an HPC project locally. After a minute or so, mpiexec.exe’s command window appears as shown here:
After a few more minutes, four additional lines appear:
When Mpiexec.exe completes execution a few minutes later and sends its data to a blob in Windows Azure storage, the Command Prompt window replaces Mpiexec.exe’s window.
6. Close the command prompt and type https://accountName.blob.core.windows.net/lsiresult/lsiresult in IE’s address bar to display the results from the U.S. SEC’s 10K reports in comma-separated values (*.csv) format:
7. Optionally, save the data as a *.txt file, change the file extension to *.csv and open it in Excel for further analysis:
8. Right-click the AppConfigure project and choose Set As Startup Project to re-enable deployment to Windows Azure.
9. If you were in the process of configuring and deploying LSICloudApplication to Windows Azure when you ran the program locally, press F5 to restart the program, click the Check Cluster Status button to enable the Submit Job button and continue at the “Configuring and Deploying the Windows Azure HPC Cluster” section’s step 28 to upload the application code and libraries to the Windows Azure HPC Scheduler.