Monday, December 05, 2005

Problems Uploading a Google Base Custom Item Type from a TSV File

Google Base lets you define custom item types to supplement the standard item types—such as Products, News and Articles, Reference Articles, recipes, and the like. This entry describes what appear to me to be bugs or errors in creating and bulk-uploading files for custom item types.

You create custom item types by specifying a combination of predefined and optional custom attributes by bulk-uploading a user-defined, tab-separated-values (TSV) file with a .txt extension. The predefined title attribute, which has a maximum length of 80 characters, and a description attribute are mandatory. (The description attribute reportedly has a maximum length of 65,536 characters for TSV files; Atom 0.3 XML files limit the description attribute (context element) to 10,000 characters.)

An unique id value (an optional primary key) for an item ensures that items retain their original identity when updated. The TSV file must contain a first row of attribute name headers.

Custom attributes require a c: prefix for the attribute name, as in c:attribute_name; data type assignment (c:attribute_name:data_type) is optional. Valid data types include string, integer, decimal, dateTime, location, URL and boolean. A row of TSV data follows for each entry.

Note: Read my earlier "Google Base and Bulk Uploads with Microsoft Access" entry for a description of the basic bulk-update process for predetermined item types, such as Products.

Problem 1 (Google): The help topic states that boolean values may be Yes/No or True/False. Only true or false is allowed; True or False throws Bad Data errors during bulk uploads. [This problem was reported to be corrected the week of December 5, 2005].

You substitute a custom for a standard item type by specifying the registered file name in the Specify a Bulk Upload File page, selecting the Submit a Custom Item Type option for Item Type, and typing the custom type name in the adjacent text box. Figure 1 is a screen capture of the partial page for the NAICS Industry Taxonomy that the next section describes.

Figure 1

Problem 2 (Google): Figure 1's help topic states that you can "upload up to 100000 items per bulk upload, with a maximum file size of 10MB." Doing so might be possible, but uploading more than a few hundred items results in this error message: "You have exceeded the activity limit for this account on the beta version of Google Base. Not all of your active items may be displayed on your Dashboard." [Tests with about 3,850 items indicate that a maximum of 3,000 appear in the Dashboard and you can browse 1,000 items.]

Bulk-uploading TSV file formatting and operational problems aren't likely to become apparent unless you create a reasonable number of "real-world" custom items that have a combination of predefined and customer attribute name/value pairs. The following sections describe the problems and their solutions or workarounds for two moderately complex custom items created from data sources that aren't subject to copyright restrictions.

Item Types for Industrial Taxonomies Most classified advertising and shopping sites have predetermined merchant and product taxonomies. Industrial taxonomies are hierarchical lists of business and governmental activities that categorize—usually by number—the activities manufacturersers, distributors, wholesalers, and retailers of products, as well as service organizations.

Note: Google Base has a list of sequential—not hierarchical—Business Location Bulk Upload Categories for categorizing Froogle Local stores. Categories 1 to 473 are sequential by class—e.g., Business-to-Business, Education, Entertainment, Government Offices—and specialty—e.g., Manufacturers, Music Schools, Movie Theaters, Courts. Google added categories 474 to 515 (as of December 5, 2005) an ad hoc process. Proprietary taxonomies of this nature are very common, but difficult to maintain (as Google has discovered).

Many industrial taxonomies have appeared to standardize data entry and collection. Most standardized industrial taxonomies are created by governmental or quasi-governmental agencies for collecting, categorizing, and summarizing economic activity. You might liken a standardized taxonomy's numerical values to "Web 2.0" tags, but the numbers have no meaning to users without additional text information, which might require access to a current copy of the taxonomy.

Note: Taxonomy tags require the category number and its specific descriptive title to be useful. Thus taxonomy tags don't qualify as folksonomy tags, which users choose arbitrarily. Tom Gruber's "Ontology of Folksonomy: A Mash-up of Apples and Oranges" paper differentiates between ontology and folksonomy, and distinguishes categories from tags. Many items require multiple tag that can be used, for instance to search for parents in the hierarchy. 37signals.com's "Tag formats: Can't we all just get along?" post by 37signals.com's Matt Linderman describes the various types of multiple tag formatting in common use today."

One of the first such hierarchical industry/activity lists was the U.S. Standard Industrial Classification (SIC), which was established in 1939 and last updated in 1967. The SIC uses a four-digit code for the most detailed entries of the hierarchy. The North American Industry Classification System (NAICS) supplanted SIC in 1997. Although SIC is obsolete, it's still in common use by many mature small an medium-sized businesses (SMBs).

NAICS is the standard industrial taxonomy for the 1994 North American Free Trade Agreement (NAFTA). NAICS's six-digit codes define U.S. industries; five-digit code define industry names common to the three NAFTA signatories. The U.S. Bureau of the Census administers NAICS and updates it every five years. The 2002 update to NAICS has 667 six-digit U.S. industry codes, which include farming, manufacturing, wholesaling, and other sectors that aren't associated with retailing. The domain of the hierarchical NAICS ontology has 1,833 entries with two-, three-, four-, five-, and six-digit codes. You can download NAICS 2002 tables, flat files, or both, from links on this page.

NAICS is acandidate candiate for generating Google Base custom item type that users can search to determine the correct NAICS code for their business. For example, Figure 2 shows that the search for "computer" as a keyword returns 21 items from the NAICS Industry Taxonomy item type. Click here to open the Posted Items page.

Figure 2

Figure 3 is a screen capture of a typical NAICS item as seen by a typical user. Click here to open the OalLeafSystems's [sic] Items page.


Figure 3

Notice that the item type name appears, by default, as a single label attribute value. NAICS tables provide detailed description of most five-digit and six-digit NAFTA industry and all six-digit U.S. industry codes. Users can imply product from industry descriptions; industry descriptions also are useful for categorizing businesses for sale.

Note: A very complex Access (Jet) SQL query generates a table for exporting complete NAICS listings, which must include five- and six-digit NAICS codes, and adds LEFT JOINS for cross-references to SIC codes and descriptions, plus United Nations International Standard Industrial Classification (UN-ISIC or ISIC) Rev. 3.1 codes and descriptions. The table with unique-valued rows for all NAICS, and related SIC, and UN-ISIC codes contains 3,940 items, which include additions to NAICS codes for six-digit NAFTA industries with the sixth digit = 0. Additional items are added for NAICS codes that have more than one SIC or ISIC code. UN-ISIC codes aren't included in this section's screen captures. If you're interested in the Jet SQL or Transact-SQL query to create the sample table, please leave a comment.

Problem 3 (Google): All custom item type items you add expire in 30 days, regardless of the predefined expiration_date value you supply. Monthly expiration might be appropriate for Products items, but Taxonomies definitely should not expire. The appropriate solution is for Google to add an unmarked-by-default "Expires in 30 days" check box to the Specify a Bulk Upload File page. Optionally, users can then specify an expiration_date value; for example, 2007-12-31 would be appropriate for NAICS 2002 data, because NAICS will be updated in 2007.

An append query regenerates the Access 2003 table (NAICS2000Codes) shown by Figure 4 in datasheet view:


Figure 4

Notice that the NAICS_Code, NAICS_Title, SIC_Code, and SIC_Title column names use upper and lower case characters, rather than Google Base's standard lower-case naming convention. Appending the SIC_Code text to NAICS_Code text generates the unique id value.

Note: Figure 3 shows normalization of the column names to Details attribute names by replacing the underscore with a space. However, if you look carefully in Figure 2, Google Base demotes the item type name and Details attribute/value pairs to lower case.

Problem 4 (Google): Lower-casing mixed-case item names and attribute name/value pairs disguises names, values, or both that contain upper-case abbreviations or acronyms. Lower-casing isn't necessary for case-insensitive searches.

Problem 5 (Access 2003): An Access append query to an empty table—rather than a make-table query—is required because the description column often gains the Text data type—rather than the required Memo data type that can handle more than 255 characters. Most description values exceed a length of 255 characters. The id column is specified as the primary key to alert users to potential unique value errors.

Problem 6 (Access 2003): Exporting the content of Access Memo columns to TSV or CSV fields with the Export Text Wizard truncates the description value at 255 characters. The simplest method is to Export the table to an Excel 2003 workbook, add the c: prefix and optional :integer or :string suffix to the custom field name values, save the workbook in Text (Tab delimited) (*.txt) format, open the .txt file in Notepad and replace all double-quotes (") with nothing. Alternatively, export the table to SQL Server 2005 Developer edition or higher and use SQL Server Integration Services (SSIS) to export a LATIN-1-encoded TSV file.

Note: Excel 2003's Save As Text (Tab Delimited) selection doesn't enable specifying empty text qualifiers, so fields whose value starts with an alphabetic character gain "text" qualifiers. Using SSIS to create a LATIN-1 TSV file from an imported Access table is a complex process. Leave a comment if you're interested in this method, and I'll add the illustrated procedure in a subsequent item.

Bulk-unloading the correctly formatted TSV file—NAICS2002Codes.txt for this example—as the the NAICS Industy Taxonomy custom item type throws an "Attribute has too many values error" when attempting to upload the item on file line 174, as shown in Figure 5.


Figure 5

The full text expansion of the Bad Data for SIC Code 2099 is: Food Preparations, NEC (except bouillon, marshmallow creme, spices, extracts, peanut butter, perishable prepared foods, tortillas, tea, spices, dip mix, salad dressing mix, seasoning mix, and vinegar), which contains 15 commas.

Problem 7 (Google): The predefined label type permits up to 10 phrases that have a maximum of 40 characters each in the CSV (comma-separated-values) format without "text" text specifiers. Google Base treats the details values as the CSV text for label attribute values. The only current domain-wide workaround is to replace commas with semicolons in the description field. A better solution to the problem is for Google Base to add all CSV values, flag the item for editing, and prevent it from publishing until edits alter the offending values.

Added 12/12/2005: A related Google Base problem is inconsistent treatment of values containing commas in custom details fields. Figure 5A illustrates truncation of the uploaded UN-ISIC Title (hightlighted) for UN-ISIC Code 5233. The full uploaded text for this title is Retail sale of household appliances, articles and equipment. The clause preceding the comma is missing.



Figure 5A (Added 12/12/2005)

Click here to open the Items page for NAICS code 443112.
The title attribute truncates at 100 characters and adds an unrendered <b> ... </b> HTML tag to the value, as shown for NAICS code 334511 in Figure 6. Click here to open the OakLeafSystems's [sic] Items page.



Figure 6

The full title length for the entry shown in Figure 6 is 113 characters.

Problem 8 (Google): The help topic for the title attribute states that its maximum length is 80 characters. Tests show the correct value is 100 characters. Also, HTML tags aren't valid in Google Base attribute values. The workaround is to truncate the title attribute to a maximum of 100 characters and add ... to the truncated value.

Adding Labels to Filter Items by NAICS Sector, Subsector, and Industry Group
It's a common requirement of hierarchical lists to enable filtering items at upper levels of the hierarchy. For this example, the NAICS hierarchy consists of Sector (two digits) -> Subsector (three digits) -> Industry Group (four digits) -> Industry (five digits) -> NAFTA/U.S. Industry (six digits). There are one or more items for each Industry and NAFTA/U.S. Industry members. Although you can filter on parent numbers or strinconvenientmore covenient to let users specify filter criteria with one or more predefined label values. You can assihierarchy nine hiearchy values as a label value; one label value is reserved for the custom item type; NAICS Industrial Codes from OakLeaf_Systems replaces NAICS Industry Codes from OakLeafSystems for this example.

Note: The custom item type label doesn't appear in the Labels editing textarea until Google Base publishes the original or updated item. As of December 7, 2005, publishing appears to occur more quickly (one hour or less) than in the past (several hours or more).

Numerical values require descriptions to be useful. Thus, a typical label attribute value for the three NIACS parent codes is 11 - Agriculture; Forestry; Fishing ..., 111 - Crop Production, 1111 - Oilseed and Grain Farming or 81 - Other Services (except Public A ..., 811 - Repair and Maintenance, 8112 - Electronic and Precision Equi ....

Individual attributes are limited to 40 characters, so ellipsis indicates a truncated title.

Note: VBA code behind a simple form with a single command button iterates the enhanced NAICS2002Codes table and updates the value of the source table's label column with the appropriate label attribute. If you're interested in the VBA code for this process, add a comment.

Problem 9 (Google): Restricting the length of individual label attributes to 40 characters is an issue when constructing precise keyword filters such as those derived from hierarchical lists. There is no workaround, so Google should increase the maximum length of an individual label attribute to at least 80 (and preferably 100) characters for custom (or all) item types. Multiple lines in Posted Items lists and PostingAlias's items page to display label values are required regardless of the added length.

Note: To display the following figures from Google Base, open the application without signing, click the Google Base Beta image to return to the home page, type OakLeaf_Systems in the Find Items Posted by Others text box, and click Search Base to display the default Posted Items list. Type 33 in the search text box, and click the niacs industrial taxonomy item to display a few items from Sector 33.

Filtering by label attribute values leaves much to be desired in the current (December 7, 2005) beta version. Only one or two of the three hierarchy attributes appear in entries on the the Posted Items page and inclusion of label attribute values is inconsistent in the list. The More or Less option that appears with short label terms also is missing.

Figure 7 illustrates the first part of the the default initial Posted Items page for the enhanced NAICS taxonomy bulk-uploaded to Google Base on December 7, 2005.


Figure 7

Click here to open the current Posted Items page, which displays items in random order. Notice that NAICS code 335122 has no visible hierarchy attribute values; 333924 has one (333 - machinery manufacturing).

Figure 8 illustrates the OakLeaf_Systems's items page for an item thierarchythree hiearchy labels with lengths that exceed 40 characters. Notice that the page is missing the initial Sector label.


Figure 8

Note: The enhanced NAICS2002Codes.txt bulk-upload file (2.2 MB) includes UN-ISIC Rev. 3.1 codes and titles, which appear as Details attributes in Figure 8. The advantage of UN-ISIC codes is that they have a coordinated products taxonomy—United Nations Central Product Classification (CPC or UN-CPC) Ver 1.1. A similar NAICS-coordinated product taxonomy—North American Product Classification System (NAPCS)—is under development for 12 specific NAICS sectors. The initial NAPCS taxonomy is scheduled for completion by the end of 2005. Given governmental productivity issues in this area, don't hold your breath for imminent NAPCS page activation.

Click here to open the OakLeaf_Systems's items page for NAICS code 335122. Clicking either visible hierarchy item with trailing ellipsis returns no matching items.

Figure 9 illustrates the OakLeaf_System's items page for an industry with one fairly short label—333 - Machinery Manufacturing.


Figure 9

Click here to open the OakLeaf_Stystems's items page for NAICS code 333924. Clicking the 333 - Machinery Manufacturing hierarchy item returns 62 items. Click here to display the Posted Items list. Clicking the 3339 - Other General Purpose Machine ... hierarchy entry with trailing ellipsis returns no matching items.

Problem 10 (Google): You can't filter searches with label attribute values that aren't accessible in the Posted Items or PostingAlias (e.g. OakLeaf_Systems) items page. Sufficient space should be allocated to display all individual label attribute values on both pages. If this is impractical for Posted Items, it must be allocated on the PostingAlias items page.

Problem 11 (Google): Adding ellipsis to indicate partial individual label values prevents matching identical valhierarchyilter by hiearchy. Unless there is a simple workaround, the problem must be fixed.

Problem 12 (Google): The Refine Your Search element only displays the lower-cased custom item name; the remaining space for links usually is occupied by inactive Details items. The more ... or fewer ... links that appear for labels in the Products Posted Items list are missing from custom item lists. There is no valid reason for this omission.

Problem 13 (Google): Details items occupy space in the Posted Items list but serve no purpose there. Users should be able to click a Details attribute in this list to filter the list, rather than manually typing Details attribute values in the search text box.

Preliminary Problem Summary
One of the original examples of potential Google Base content cited by Google was the "genome of the 1918 influenza pandemic," according to the November 15, 2005 "Google Base service goes live" article by CNet's Linda Mills. There's considerable controversy about publishing the structure of or recreating the 1918 virus, as Jamais Cascio reports in his original "Sequencing the Killer Flu" and later "Safety in Knowledge" posts. Mills' earlier "Google wants your car listings, events" article includes a reference to "database of protein structures" from a late-October 2005 pre-beta screen capture. Google references to "genome" and "protein structures" have disappeared with good reason: Current Google Base search and filtering capabilities for custom item types are far too primitive and restrictive for sophisticated scientific searches.

DNA Direct's Jason R. Bobe quotes from Chapter 26, "Googling Your Genes," of The Google Story, published November 15, 2005:
One of the most exciting Google projects involves biological and genetic research that could foster important medical and scientific breakthroughs. Through this effort, Google may help accelerate the era of personalized medicine ...

Over dinner and plenty of wine in February 2005, Sergey Brin discussed the prospects for genetics and Google with the maverick biologist Dr. Craig Venter... Not long after the dinner in California, Brin and Page teamed up with Venter...[who is quoted as saying:] "Working with Google, we are trying to generate a gene catalogue to characterize all the genes on the planet and understand their evolutionary development. Geneticists have wanted to do this for generations." Over time, Venter said, Google will build up a genetic database, analyze it, and find meaningful correlations for individuals and populations. [It is utilizing the 30,000 genes discovered by Venter and scientists from the National Institutes of Health when they were racing to beat one another to map the human genome.]
Note: You can read a longer Chapter 26 excerpt courtesy of the Washington Post and learn more about the Human Genome Project from Nature's "Genome Gateway" portal.

Odeo's November 22, 205 Future Tense podcast from American Public Media, "Googling your genes" offers a brief interview with The Google Story's co-author, David Vise, about searching genetic information in a Google Base context.

However, eWeek's Ben Charney quotes an unnamed Venter representative in his November 21, 2005 "Google Gene Project Comes into Question" article: "We do not have any ongoing projects with Google."

Pedro Beltrão casts a jaundiced eye on Google Base as a means of sharing scientific information—such as simple protein sequences as a biological sequence custom item type—in his November 17, 2005 "Google Base and Bioinformatics II" post. Click here to display a sample DNA sequence with associated images. Pedro follows up with a later "Google Base simple tricks" post that offers examples of manually-entered search URLs.

Google Base Beta has a long way to go before it can be considered a truly universal, searchable database. If you can't create an easily searchable and filterable flat table that represents a simple hierarchical industrial taxonomy, it's not likely that Google Base will be useful to store and search/filter items other than the current short-list of predefined item types. Thus, it's not surprising that Google's early references to "genome" and "protein databases" have disappeared.

Technorati:

0 comments: