OakLeaf Systems: Adam Bosworth: "Learning from the Web" and Google Base

Wednesday, November 09, 2005

Adam Bosworth: "Learning from the Web" and Google Base

Microsoft XML guru Dare Obasanjo summarizes Adam Bosworth's "Learning from the Web" article posted in the Association for Computing Machinery (ACM) Queue, and concludes:

The article ends by arguing that database vendors should add native support for the Atom Protocol and wire format. I find this interesting since based on conversations on the atom-protocol list, it is clear that Google is very interested in the Atom API. Perhaps they have already built this Atom store that Adam is arguing for and will expose the Atom API as a way to interact with it. Perhaps this Atom store accessible via Atom feeds and the Atom API is Google Base? Speculation is fun.

Note: The Google Base URL (base.google.com) now opens a login form. Click here for Business Week's initial Google Base article and here for early Google Base screen captures.

Update: 11/25/04: Google Base is now live and is accepting uploads in tab-separated-value (TSV), XML, Atom 0.3, and RSS 1.0/2.0 formats.

Craig Ogg's Software Voices blog offers similar speculation in a "Did Adam Bosworth Reveal the Real Google Base at the MySql Users Conference" post. Gordon Gould, Craig's business parter, also ruminates on Google's intentions for Base. The abstract of Adam's paper, "Database Requirements in the Age of Scalable Services," contains a link to an audio recording of his presentation. Adam distinguishes information stored in a database from content stored in Web pages, and suggests that databases with integrated query processors won't scale to Google's requirements—i.e., a billion or more queries per day. Note: Adam Bosworth is Google's Vice President of Engineering. He and Brad Silverberg were founders of Analytica, which Borland purchased to provide the foundation for its Quattro Pro spreadsheet. While at Microsoft, he was responsible for designing and delivering Microsoft Access 1.x and the HTML rendering engine for Internet Explorer 4.0. Subsequently, he, Tod Nielsen (Access's first and foremost marketing guru), and other Microsoft employees founded Crossgain, which BEA acquired. Borland recently appointed Tod Nielsen as CEO, after his brief stint as an Oracle VP. Bosworth's article makes the following two points about database scaling and caching:

3. Have databases enabled people to harness Moore’s law in parallel? This would mean that databases could scale more or less linearly to handle both the volume of the requests coming in and even the complexity. The answer is no. 4. Do databases optimize caching when it is OK to be stale? No.

Obviously, Google Base requires the capability to scale-out and cache slowly-changing or unchanging "catalog" data. SQL Server 2005 Enterprise Edition's partitioning and new Scalable Shared Databases feature addresses point 3. All except Express editions' capabilitity to invalidate ASP.NET 2.0 page caches with Query Notifications answers "Yes" to point 4. Two other points Bosworth raises have the potential to be solvable with SQL Server 2005's native XML data type and support for XQuery, full-text search, or both:

5. Do databases handle flexible graphs (or trees) well? No, they do not. 6. Have the databases learned from the Web and made their queries simple and flexible? No, just ask a database if it has anyone who, if they have an age, are older than 40; and if they have a city, live in New York; and if they have an income, earn more than $100,000. This is a nightmare because of all the tests for NULL.

The scalability of SQL Server 2005's XQuery implementations, even with XML indexes, remains to be seen. However, I wouldn't expect to be able to search Google Base with XQuery expressions. Bosworth concludes his article with this paean to Oracle on page 5:

Oracle has done a remarkable job of adding XML to its database in the various ways that customers might want. In so doing, it has added a lot of these capabilities. Its ROWID type allows some forms of flexible linkage. But none [of the database vendors] really show that they have learned from the Web.

I don't understand how a ROWID pseudocolumn can provide a better form of "flexible linkage" for XML documents than an auto-incrementing (int identity) primary key. It's my opinion that SQL Server 2005's XML data type will support RSS 2.0/Atom documents and do so more easily and efficiently than Oracle 10g. But accessing a production database with an RSS 2.0/Atom wire protocol appears to me to have serious security repercussions and performance problems. Finally, the jury's still out on the issue of whether SQL Server 2005's xml columns will partition and scale to Googlesque requirements. --rj Technorati: Databases XML Atom RSS 2.0 Google Base Microsoft Access SQL Server 2005 SQL Server Express