OakLeaf Systems: Ted Kummert at PASS Summit: Hadoop-based Services for Windows Azure CTP to Release by End of 2011

Wednesday, October 12, 2011

Ted Kummert at PASS Summit: Hadoop-based Services for Windows Azure CTP to Release by End of 2011

Ted Kummert announced on 10/12/2011 in his PASS Summit 2011 keynote a partnership with Hortonworks to port Apache Hadoop to SQL Azure by the end of 2011. From the Microsoft Expands Data Platform With SQL Server 2012, New Investments for Managing Any Data, Any Size, Anywhere press release of the same date:

Microsoft is committed to helping customers manage any data, any size, anywhere with the SQL Server data platform, Windows Server and Windows Azure. Hortonworks has a rich history in leading the design and development of Apache Hadoop. Their experience and expertise in this space helps us accelerate our delivery of our Hadoop based distribution on Windows Server and Windows Azure while maintaining compatibility and interoperability with the broader ecosystem.

•• Updated 10/16/2011 with a Hadoop’s civil war: Does it matter who contributes most? post of 10/7/2011 by Giga Om’s Derrick Harris (@derrickharris) about competition between Hortonworks and Cloudera (see end of post). Added logos.

• Updated 10/12/2011 1:10 PM PDT with a post by Gianugo Rabellino (@gianugo) to the Port 25 blog. See end of post.

Ted posted Microsoft Expands Data Platform to Help Customers Manage the ‘New Currency of the Cloud’ at 9:00 AM:

This morning, I gave a keynote at the PASS Summit 2011 here in Seattle, a gathering of about 4,000 IT professionals and developers worldwide. I talked about Microsoft’s roadmap for helping customers manage and analyze any data, of any size, anywhere -- on premises, and in the private or public cloud.

Microsoft makes this possible through SQL Server 2012 and through new investments to help customers manage ‘big data’, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks. Our announcements today highlight how we enable our customers to take advantage of the cloud to better manage the ‘currency’ of their data.

We often talk about the economics of the cloud, detailing how customers can achieve unmatched economies of scale by taking advantage of public or private cloud architectures. As an example, an enterprise with a small incubation project could theoretically take it to production overnight, thanks to the elasticity and scalability benefits of the cloud.

As we turn more and more to the cloud, data becomes its currency. The exchange of data is the heart of all cloud transactions, and, as in a real-world economy, more value is created whenever data is generated or consumed. But there are new business challenges that this currency creates: How do we deal with the scope and scale of the data we manage? How do we deal with the diversity of types and sources of data? How do we most efficiently process and gain insight from datasets ranging from megabytes to petabytes?

How do we bring the world’s data to bear on the tasks of the enterprise, as businesses ask themselves questions like: “What can data from social media sites tell me about the sentiment of my brands and products?” And, how do we enable all end-users to gain the critical business insights they need – no matter where they are and what device they are using? Customers need a data platform that fully embraces the cloud, the diversity and scale of data both inside and outside of their ‘firewall’ and gives all end-users a way to translate data into insights – wherever they are.

Microsoft has a rich, decades-long legacy in helping customers get more value from their data. Beginning with OLAP Services in SQL Server 7, and extending to SQL Server 2012 features that span beyond relational data, we have a solid foundation for customers to take advantage of today. The new addition of an Apache Hadoop-based distribution for Windows Azure and Windows Server is the next building block, seamlessly connecting all data sizes and types. Coupled with our new investments in mobile business intelligence, and the expansion of our data ecosystem, we are advancing data management in a whole new way. …

Read more.

Ted introduced Hortonworks’ Eric Baldeschwieler (@jeric14) who reported “Yahoo now has 40,000 computers running Apache Hadoop”, “Over 80 percent of new data being generated is from unstructured sources” and “Hadoop could be storing half the world’s data within five years.” •• See update at end of post for more information about Eric Baldeschwieler and Hadoop competition.

Kummert said a Community Technology Preview (CTP) of the Hadoop-based service for Windows Azure will be available by the end of 2011, and a CTP of the Hadoop-based service for Windows Server will follow in 2012.

Denny Lee demonstrated a HiveQL query against log data in a Hadoop for Windows database with a HiveODBC driver that Ted Kummert said will be available as a CTP next month (November 2011):

Denny’s Revelations – rolling the hard six to SQL BI and Hadoop post of 11/12/2011 provides more information on Apache Hadoop in SQL Azure and SQL Server:

Okay! With today’s Ted Kummert’s Day 1 Keynote of the SQL Server PASS Summit 2011, I had the honor of demonstrating how SQL BI and Hadoop rock together! As you can see from the Port 25 Microsoft, Hadoop, and Big Data and the Microsoft News Center for SQL Server 2012 posts there are a number of cool things that are happening:

It started with the Hadoop connectors for SQL Server and PDW. Key call out here is that these connectors are bi-directional to allow data movement back and forth between SQL Server and Hadoop.

Windows Server and Windows Azure optimized Hadoop distributions; out of the box (or cloud), the distributions includes support for HDFS, Hive, Pig-Latin, FTP, etc.

Our partnership with Hortonworks to help us push forward faster with optimizing Hadoop to run on Windows as noted in their post Bringing Apache Hadoop to Windows.

As part of the demo today, I showed the integration of the SQL BI stack with Hadoop by having PowerPivot (for Excel and SharePoint) interact with Hadoop for Windows cluster via Hive and the soon to be released HiveODBC driver.

Not shown today, but just as cool will be the release of the Excel Hive Add-in

More information will be posted at www.microsoft.com/bigdata as it becomes available, eh?!

Cool, so why did I use “embrace Hadoop”?

A key call out during my conversation with Ted during the keynote is that our offering is 100% compatible with Apache Hadoop – if your code works on Apache Hadoop then it will work on ours and vice versa. But, it’s not just about the code, it’s also about this shift that we are embracing the open source community!

For example, one of the key demos that I have shown is the ability to write Map Reduce jobs in JavaScript (as opposed to Java). This is what I would like to call:

Our VB moment in Big Data

That is, we had made Visual Basic a powerful language for developers and with .NET opened the door for these developers to go into the enterprise. By making JavaScript a first class language for Big Data, we are helping to enable the millions of JavaScript developers to enter the realm of Big Data. Even more awesome is the JavaScript on Hadoop, an example of one of our proposals back to the Apache Hadoop community.

So why is Big Data / Hadoop important for a BI dude or dudette?

I’ll probably have a number of posts to for this question alone, but let me give you one answer right now – this is an excerpt from my post: “Hadoop: A movement, not just a technology”

Why am I excited about Hadoop and Big Data even though I’m a Microsoft BI person for most of my career? Because first and foremost, BI is all about making sense of the information. And the greatness of Big Data isn’t just about exploring, understanding, and asking even more questions of this information, but doing it in distribution (vs. silos) and putting more emphasis on the data (i.e. this is where the real IP is)

Any other cool information on Big Data at SQLPASS this week?

Both Ted Kummert and David DeWitt’s keynotes will cover Big Data. If you cannot attend, check out the SQL Server PASS Summit 2011 Live Streaming. As well, there are two breakout sessions on Big Data, both on Thursday:

AD-216-M: Overview of Big Data on Windows and Windows Azure by Saptak Sen

BIA-408-A: SQLCAT: Tier-1 BI in the world of Big Data by Thomas Kejser and myself – with special guest Kenneth Lieu from Yahoo!

Also don’t forget that I will be hosting the Big Data table at the Birds of Feather luncheon and a bunch of us will be floating around the Big Data Kiosk in the product pavilion.

Whew! I think that’s it for today!

For more details about Big Data in the Cloud, see my Choosing a cloud data store for big data (June 2011) and Microsoft's, Google's big data [analytics] plans give IT an edge and links to Resources (August 2011) for SearchCloudComputing.com.

Available now: Download the Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) RTM:

Overview

Please Note:You must accept the license terms before you can download, install or use the software. By clicking the "DOWNLOAD" button, you are accepting the terms and conditions in the license terms. If you do not accept the license terms, do not click "DOWNLOAD."

You may access and print the license terms here: Microsoft SQL Server®Connector for Apache Hadoop License Terms.

The Microsoft SQL Server Connector for Apache Hadoop extends JDBC-based Sqoop connectivity to facilitate data transfer between SQL Server and Hadoop, and also supports the JDBC features as mentioned in SQOOP User Guide on the Cloudera website. In addition to this, this connector provides support for nchar and nvarchar data types

With SQL Server-Hadoop Connector, you import data from:

tables in SQL Server to delimited text files on HDFS

tables in SQL Server to SequenceFiles files on HDFS

tables in SQL Server to tables in Hive*

result of queries executed on SQL Server to delimited text files on HDFS

result of queries executed on SQL Server to SequenceFiles files on HDFS

result of queries executed on SQL Server to tables in Hive*

Note: importing data from SQL Server into HBase is not supported in this release.

With SQL Server-Hadoop Connector, you can export data from:

delimited text files on HDFS to SQL Server

sequenceFiles on HDFS to SQL Server

hive Tables* to tables in SQL Server

* Hive is a data warehouse infrastructure built on top of Hadoop (http://wiki.apache.org/hadoop/Hive). We recommend to use hive-0.7.0-cdh3u0 version of Cloudera Hive.

Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance.

The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to Hadoop API page.

System requirements

Supported Operating Systems: Linux, Windows 7

Linux (for Hadoop setup) and Windows (with SQL Server 2008 R2 installed). Both are required to use the SQL Server-Hadoop Connector

The download site has detailed installation and configuration instructions.

• Update 10/12/2011 1:10 PM PDT: Gianugo Rabellino (@gianugo) posted Microsoft, Hadoop and Big Data on 10/12/2011 to the Port 25 (Communication from the Open Source Community at Microsoft) blog:

In a couple of weeks it will be my one year anniversary here at Microsoft and I couldn’t wish for a better anniversary gift: now that Microsoft has laid out its roadmap for Big Data, I’m really excited about the role that Apache Hadoop^TM plays in this.

In case you missed it, Microsoft Corporate Vice President Ted Kummert earlier today announced that we are adopting Hadoop by announcing plans to deliver enterprise class Apache Hadoop based distributions on both Windows Server and Windows Azure.

This news is loaded with goodies for the big data community, broadening the accessibility and usage of Hadoop-based technologies among developers and IT professionals, by making it available on Windows Server and Windows Azure.

But there is more. Microsoft will be working with the community to offer contributions for inclusion into the Apache Hadoop project and its ecosystem of tools and technologies.

I believe that all of this will really benefit not only the broader Open Source community by enabling them to take their existing skill sets and assets use them on Windows Azure and Windows Server, but also developers, our customers and partners. It is also another example of our ongoing commitment to providing Interoperability, compatibility and flexibility.

As a proud member of the Apache Software Foundation, I personally could not be happier to see how Microsoft is willing to engage in such an important Open Source project and community.

Technical Considerations

On the more technical front, we have been working on a simplified download, installation and configuration experience of several Hadoop related technologies, including HDFS, Hive, and Pig, which will help broaden the adoption of Hadoop in the enterprise.

The Hadoop based service for Windows Azure will allow any developer or user to submit and run standard Hadoop jobs directly on the Azure cloud with a simple user experience. [Emphasis added.]

Let me stress this once again: it doesn’t matter what platform you are developing your Hadoop jobs on -you will always be able to take a standard Hadoop job and deploy it on our platform, as we strive towards full interoperability with the official Apache Hadoop distribution.

This is great news as it lowers the barrier for building Hadoop based applications while encouraging rapid prototyping scenarios in the Windows Azure cloud for Big Data.

To facilitate all of this, we have also entered into a strategic partnership with Hortonworks that enables us to gain unique experience and expertise to help accelerate the delivery of Microsoft’s Hadoop based distributions on both Windows Server and Windows Azure.

For developers, we will enable integration with Microsoft developer tools as well as invest in making Javascript a first class language for Big Data. We will do this by making it possible to write high performance Map/Reduce jobs using Javascript. Yes, Javascript Map/Reduce, you read it right. [Emphasis added.]

For end users, the Hadoop-based applications targeting the Windows Server and Windows Azure platforms will easily work with Microsoft’s existing BI tools like PowerPivot and recently announced Power View, enabling self-service analysis on business information that was not previously accessible. To enable this we will be delivering an ODBC Driver and an Add-in for Excel, each of which will interoperate with Apache Hive.

Finally, in line with our commitment to Interoperability and to facilitate the high performance bi-directional movement of enterprise data between Apache Hadoop and Microsoft SQL Server, we have released two Hadoop-based connectors for SQL Server to manufacturing.

The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server 2008 R2, while the SQL Server PDW connector for Apache Hadoop moves data between Hadoop and SQL Server Parallel Data Warehouse (PDW). These new connectors will enable customers to work effectively with both structured and unstructured data.

I really look forward to sharing updates on all this as we move forward. For now, check out www.microsoft.com/bigdata and check back on the [Data Platform Insider] (DPI) blog tomorrow.

Gianugo is Sr. Director of Open Source Communities at Microsoft.

•• Updated 10/16/2011 with Derrick Harris’s (@derrickharris) Hadoop’s civil war: Does it matter who contributes most? post of 10/7/2011 to Giga Om’s structure blog about competition between Hortonworks and Cloudera:

If you were going to buy a service contract for your open-source software, would you prefer your service provider actually be the certifiable authority on that very software? If “yes,” then you understand why Cloudera and Hortonworks have been playing a game of one-upmanship over the past few weeks in an attempt to prove whose contributions to the Apache Hadoop project matter most. However, while reputation matters to both companies, it might not matter as much as fending off encroachments to their common turf.

A few weeks ago, Hortonworks, the Hadoop startup that spun out of Yahoo in June, published a blog post highlighting Yahoo’s — and, by proxy, Hortonworks’ — impressive contributions to the Hadoop code. Early this week, Cloudera CEO Mike Olson countered with gusto, laying out a strong case for why Cloudera’s contributions are just as meaningful, maybe more so. Yesterday, it was Hortonworks CEO Eric Baldeschwieler firing back with even more evidence showing that, nope, Yahoo/Hortonworks is actually the best contributor. The heated textual exchange is just the latest salvo in the always somewhat-acrimonious relationship between Yahoo and Cloudera, but now that Team Yahoo is in Hadoop to make money, he who claims the most expertise might also claim the most revenue.

From Olson's post.

From Baldeschwieler's post.

Hortonworks is betting its entire existence on it. With the company likely not offering its own distribution, Hortonworks will rely almost exclusively on its ability to support the Apache Hadoop code (and perhaps some forthcoming management software) for bringing in customers. This is a risky move.

To make a Linux analogy, Hortonworks is playing the role of a company focused on supporting the official Linux kernel, while Cloudera is left playing the role of Red Hat, selling and supporting its own open-source, but enterprise-grade, distribution. Maybe Hortonworks should try to be Hadoop’s version of Novell. Whatever you think about the companies’ respective business models, though, it’s clear why reputation matters.

However, I’ve been told by a couple of people deeply involved in the big data world that perhaps Hortonworks and Cloudera would be better served if they spent their energies worrying about a common enemy by the name of MapR. MapR is the Hadoop startup that has replaced the Hadoop Distributed File System with its own file system that it claims far outperforms HDFS and is much more reliable, and that already has a major OEM partner in EMC.

Ryan Rawson, director of engineering at Drawn to Scale and ~~chief~~ an architect ~~for~~ working on HBase, told me he’s very impressed with MapR and that it could prove very disruptive in a Hadoop space that has thus far been dominated by Cloudera and core Apache. “The MapR guys definitely have a better architecture [than HDFS],” he said, with significant performance increases to match.

Rawson’s rationale for finding such promise in MapR is hard to argue with. As he noted, “garage hobbyists” aren’t building out large Hadoop clusters, but rather, real companies doing real business. If MapR’s file system outperforms HDFS by 3x, that might mean one-third the hardware investment and fewer management hassles. These things matter, he said, and everyone knows that there’s no such thing as a free lunch: Even if they give away the software, Cloudera and Hortonworks still sell products in the form of services.

It’s not just MapR that’s trying to get a piece of Apache Hadoop’s big data market share, either. As I explained earlier this week, there are and will continue to be alternative big data platforms that might start looking more appealing to customers if Hadoop fails to meet their expectations.

The Apache Hadoop community, led for the most part by Hortonworks and Cloudera, has some major improvements in the works that will help it address many of its criticisms, but they’re not here yet. Does it matter which company drives the code and patches for those improvements? Yes, it does. But maybe not as much as burying the hatchet and making sure the Apache Hadoop they both rely on remains worth using.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight

Defining Hadoop: the Players, Technologies and Challenges of 2011

Putting Big Data to Work: Opportunities for Enterprises

Note: Neither Olson nor Harris note that Hortonworks can claim the right to add Yahoo’s code contributions to their own, resulting in 30% (about 700,000 lines) of code contribution by Hortonworks and its predecessor.