Sunday, July 08, 2007

Michael Stonebraker on Accessing Data as Entities and Low-Latency Streaming

The May/June 2007 issue of the Association for Computing Machinery's ACM QUEUE magazine includes a wide ranging interview: "A Conversation with Michael Stonebraker and Margo Seltzer: Relating to Databases." Stonebraker was the primary force behind UC Berkeley's development of the Ingres relational database, which provided the foundation for the PostgreSQL open-source database, and is an adjunct professor of computer science at MIT. Margo Seltzer is professor of computer science at Harvard University and one of the founders of Sleepycat Software, publishers of the Berkeley DB embedded database engine. Ingres Corporation is owned mostly by Computer Associates and Oracle acquired Sleepycat Software in 2006.

Stonebraker's major premise is that today's one-size-fits-all relational OLTP and OLAP databases/applications, which he calls legacy code sold by elephants, will be displaced by in important vertical markets by new technology sold by start-ups, which Seltzer calls mice. Along the way, he attributes the lack of object-oriented databases' market success to their failure to gain a foothold in the computer-aided-design (CAD) industry.

Accessing Data as Entities

Stonebraker isn't sanguine about the long-term prospects for SQL, which he characterizes as a language that "started out as a simple standard and [grew] into a huge thing, with layer upon layer of junk."

However, he likes Ruby on Rails' incorporation of data access in the language itself—a la LINQ—and implementation of the entity-relationship model as the ADO.NET Entity Framework does. Stonebraker says:

C++ and C# are really big, hard languages with all kinds of stuff in them. I’m a huge fan of little languages, such as PHP and Python. Look at a language such as Ruby on Rails. It has been extended to have database capabilities built into the language. You don’t make a call out to SQL; you say, “for E in employee do” and language constructs and variables are used for database access. ...

Let’s look at Ruby on Rails again. It does not look like SQL. If you do clean extensions of interesting languages, those aren’t SQL and they look nothing like SQL. So I think SQL could well go away. More generally, Ruby on Rails implements an entity-relationship model and then basically compiles it down into ODBC. It papers over ODBC with a clean entity-relationship language embedding.

I don't agree with Stonebraker's description of Ruby on Rails compiling the E-R model to ODBC, which—along with JDBC—he characterizes as "the worst interfaces on the planet. I mean, they are so ugly, you wouldn’t wish them on your worst enemy." However, I share his disdain for ODBC, which is two generations older than SqlClient.

In reality, Stonebraker is describing the Entity Framework and LINQ to Entities.

Although LINQ has an SQL-like syntax, I certainly prefer it to the following Ruby on Rails examples:

class BlogComment < ActiveRecord::Base
  belongs_to :blog_post


class BlogPost < ActiveRecord::Base
  has_many :blog_comments, :order => "date"


class BlogPost < ActiveRecord::Base
  has_many :blog_comments, :order => "date",
  :dependent => true

Even LINQ's chained method syntax is easier to read than the preceding snippets from Exploring Ruby on Rails. Another benefit of the EF, as well as LINQ to SQL, is that you aren't locked into the ActiveRecord model.

Low-Latency Streaming

Another issue that Stonebraker raises is the issue of latency when processing streaming data, especially in the financial markets. Financial data feeds "are going through the roof" and "legacy infrastructures weren’t built for sub-millisecond latency, which is what everyone is moving toward." Stonebraker observes:

Until recently, everyone was using composite feeds from companies such as Reuters and Bloomberg. These feeds, however, have latency, measured in hundreds of milliseconds, from when the tick actually happens until you get it from one of the composite-feed vendors.

Direct feeds from the exchanges are much faster. Composite feeds have too much latency for the current requirements of electronic trading, so people are getting rid of them in favor of direct feeds.

They are also starting to collocate computers next to the exchanges, again, just to knock down latency. Anything you can do to reduce latency is viewed as a competitive advantage.

Let’s say you have an architecture where you process the data from the wire and then use your favorite messaging middleware to send it to the next machine, where you clean the data. People just line up software architectures with a bunch of steps, often on separate machines, and often on separate processes. And they just get clobbered by latency. ...

If I want to be able to read and write a data element in less than a millisecond, there is no possible way that I can do that from an application program to any one of the elephant databases, because you have to do a process switch, a message to get into their systems. You’ve got to have an embedded database, or you lose.

In the stream processing market, the only kinds of databases that make any sense are ones that are embedded. With all the other types, the latency is just too high.

Of course the same requirement holds true for processing digital sensor streams. These are the types of applications that might benefit from LINQ's expression tree approach. Oren Novotny's LINQ to Streams (SLinq) and Aaron Erickson's LINQ to Expressions (MetaLinq) are interesting steps in this direction. Stonebraker's comment about the reduced latency of embedded databases adds support for my contention that LINQ to SQL should support SQL Server Compact Edition (SSCE).