Saturday, September 11, 2010

The trouble with Google Books

How rampant errors threaten the scholarly mission of the vast digital library.
by: Laura Miller

Depending on who you ask, Google Books -- the pioneering tech company's ambitious plan to "digitally scan every book in the world" and make them searchable over the Web and in libraries -- is either a marvelous, utopian scheme or an unprecedented copyright power-grab. The people who can claim to fully understand the Google Books Search Settlement -- the resolution of a class-action suit filed against the company by the Authors Guild and the Association of American Publishers -- may be as few as those who comprehend the theory of special relativity.

But everyone seems to agree that Google Book Search represents a revolutionary boon to scholars, especially people embarked on specialized research but without ready access to a university library. But is it? As UC-Berkeley professor Geoffrey Nunberg pointed out in an article for the Chronicle of Higher Education last year (expanded from a post on the blog Language Log), a research library is only as useful as the tools required to extract its riches. And there are some serious problems with the bibliographic information attached to many of the digital texts in Google Books.

Nunberg, a linguist interested in how word usage changes over time, noticed "endemic" errors in Google Books, especially when it comes to publication dates. A search for books published before 1950 and containing the word "Internet" turned up the unlikely bounty of 527 results. Woody Allen is mentioned in 325 books ostensibly published before he was born.

Other errors include misattributed authors -- Sigmund Freud is listed as a co-author of a book on the Mosaic Web browser and Henry James is credited with writing "Madame Bovary." Even more puzzling are the many subject misclassifications: an edition of "Moby Dick" categorized under "Computers," and "Jane Eyre" as "Antiques and Collectibles" ("Madame Bovary" got that label, too).

Although Google representatives did respond to Nunberg's article, blaming the bulk of the errors on outside contractors, much of the incorrect information remains in place. Looking at listings for "The Golden Bough" by James Frazer, a seminal work on comparative religion with a complex and fascinating publication history, I found one edition characterized as "Life Sciences." The 12 volumes of what is arguably the most authoritative edition of the book (published between 1910 and 1915) aren't grouped together or searchable as a whole, and the foremost search result is a dubious reprint of the bowdlerized 1922 edition with an introduction lifted from Wikipedia and a publication date of 1947, although the text itself claims a publication date of 2008.

I've already written about inadequate metadata -- specifically how it can curtail readers' choices. So I gave Nunberg a call to find out how flawed metadata affects historians and other scholars.

What is metadata?

Metadata is data about a text or work. The card for a book from an old card catalog is metadata: The title, the author, the publisher, the date of publication, the number of pages and so on. In the future, it could also include all sorts of other information, such as how many people have read it, or how many copies of it have sold.

When you're dealing with any collection of books -- whether it's a research library or your local Barnes and Noble -- you need something like metadata. Say you're looking for a children's book on antelope or birds, so you go to the children's section and within that section you look for the "nature" category. Similarly, if you want a novel by Anthony Trollope, you use the metadata of the retail space to find it. It's in the fiction or literature sections, shelved alphabetically by author.

So the actual physical organization and shelves in a bookstore are a form of metadata, since they provide you with information about the books contained in each section of the store?

Yes. Even at home, if you've got more than, say 100 books (or more than 100 of anything, really) you have a system for organizing them in some way so you can find what you want. Everyone has metadata, even if it's just alphabetical order. And that's even more important with a scholarly collection.

And what are the problems with the way Google Books handles metadata about the books in its collection?

Google Books was conceived of in two ways. The first is as a new library -- I call it the "last library" -- an aggregate of all the libraries in the world. The second is as a big database, a storehouse of information that you could search the way you search Google. The idea behind that is that books are just stored information. If I want to know who wrote Roosevelt's inaugural speech, I can do a search and look it up.

But those two ideas are at odds with each other, which is something that Google didn't realize. The beauty of Google is that you don't need metadata, after all. You just barrel into the text and pull out what you want. So metadata -- information about the source text -- was not something they focused on.

How is that inattention a problem? Why is metadata important to, for example, scholars?

Metadata includes information about a particular text, and sometimes also about a particular copy: Zhou Enlai's personal copy of Marx, for example, might be of special interest to a scholar. I might want to search for the first sentence of a Henry Fielding novel across different editions. That information can't be derived from Googling. And I might want to search across collections: How often was a word used in a particular historical period? In that case, the accuracy of metadata about each book is crucial.

Even though you're not looking at each individual book, one at a time. I see. Can you give me an example?

There's this observation that "United States" was used first as a plural noun, but now it's invariably used as a singular noun, which reflects an evolution in how people viewed the nation. Supposedly, this changed with the Civil War, but it's actually more complicated than that. If you don't have the correct metadata -- in this case, the publication date -- attached to the texts, then you can't do an accurate search on how the word was used (before and after 1865).

Google has also included in its metadata a system of subject-matter classification designed for the book trade known as BISAC. How is that a problem?

Well, Google may not be applying the risible BISAC subject categories anymore, at least not to older titles that weren't given BISAC categories when they were published. (The BISAC standard is only a couple of decades old.) So "Madame Bovary" is no longer classified in Antiques and Collectibles!

BISAC is just right for a local Barnes and Noble store, but even when it's correctly applied, it's hopeless for a larger collection. It was designed as a way for publishers to tell booksellers where to shelve a book so that their customers could find it. That's why BISAC has 20 subcategories for children's books about various animals -- books about bears, or about monkeys, for example -- but only one category for European poetry. In a retail bookstore, you're not going to have a section for 18th century Italian poetry or 17th century German poetry; all the European poetry is going to be shelved together. But that's a ridiculous way to classify the collection of the Harvard Library.

How did some of the more outrageous mistakes happen, such as categorizing Walt Whitman's "Leaves of Grass" as a book about botany or listing Henry James as the author of "Madame Bovary"?
I still don't know what the story is. Several people at Google took pains to respond to my original blog posting about this issue, and they claim that many of these errors originated with the providers (libraries or commercial services hired to provide metadata about books), not Google. It's true that no metadata source is perfect. The Harvard Library makes mistakes, too. But nothing on the scale I found in Google Books. The Harvard Library does not have Henry James as the author of "Madame Bovary."

My guess would be that there was an edition of "Madame Bovary" that had James' name on it somewhere, maybe as the author of an introduction, and in the automated process of scanning the books, the wrong name got identified as that of the author.

I thought it was a machine error, too, but Google assured me that they had people doing this by hand. In some cases, they got their metadata from a provider in Armenia. They say that they want to have a diversity of sources to get a more complete classification for every book, but that's just silly. The metadata at the Harvard Library was done by hand by smart people who know how to catalog.

People at Google are also saying, "Let's crowdsource this," but that is a stupid idea. You and I are both smart, knowledgeable people, but I wouldn't trust either of us to do the skilled work of cataloging a 1890 edition of "Madame Bovary." It's very difficult. It has to be coordinated by uniform standards. An example of the kind of mess you get when you don't use uniform standards is Wiktionary (the lexical counterpart of Wikipedia). Unlike an encyclopedia, a dictionary isn't useful unless it's consistent in style. And metadata is hard to fix if you don't get it right in the first place. Someone has to spend a lot of money to properly catalog a research library, and I don't know if Google understood that going into it.

But surely these books are already cataloged by the libraries whose collections Google is scanning?
Yes, but Google isn't using an alternate, more comprehensive system, such as the Library of Congress cataloging system. They could license that. In time, you could generate all kinds of interesting new classifications, too, but you have to have the old ones.

What are some of the other problems you've had trying to do linguistic research using Google
Books?

I can't find all the volumes of the Century Dictionary (an important lexical reference first published in the late 1800s) in a particular edition at once. Sometimes a volume comes up and sometimes it doesn't. Sometimes I get volumes from different editions. Serial works are also difficult. I've been researching the changing use of the word "sensitivity." I'd get hits for numbers of a journal that began publishing in the 1950s, so all of them are dated in the '50s, even though the issue where the word was found is actually from the 1970s.

Then there are problems with the scanning itself. I was researching the history of the word "cad," and got a result in the Transactions of the Philological Society from the late 19th century challenging the OED definition. But I can't read the first four pages of it because all four pages are bunched together and there's someone's thumb in the image. Now, no one is going to go back and rescan those pages -- it would cost more than scanning the whole shelf -- so that's it. As far as the digital collection is concerned, those pages are lost. I could find them by going to the Bodleian Library (in Oxford, England) and asking them to pull that out of whatever deep storage they have it in, but realistically, I'm not going to do that. It's too difficult to get to.

It's not like the information is actually lost, however, and it's not like that information wasn't just as difficult to get to before Google Books came along.

You're absolutely right. People have accused me of looking a gift horse in the mouth. Let me be clear: I love Google Books. It's an amazing resource for scholars. I don't think they knew what they were getting into, though. Of course, if they hadn't been insensitive to the subtleties of the task, maybe they wouldn't have taken it on. A friend who's worked there told me that it's a culture that awards innovation, even if it's something relatively useless, like a map function that shows you all the place-names mentioned in a book. You get less credit at Google for making sure that old things continue to work well.

Since my initial blog posting, however, Google has shown themselves to be aware of what they're dealing with. They want to see themselves in the right light and they don't want to be seen as criticizing librarians. My goal was really to get the librarians to talk to Google, because until recently they've been been taking it for granted that Google Books will do it right.

Because if this really is the "last library," as I put it, and no one is going to go back and do all this scanning again, which I think we can all agree is probably the case, then it's really important that it be done right. And it's going to cost a lot of money to do it. A disproportionate percentage of the resources have to go to a relative small percentage of users. That's what a research library is all about. That is the nature of scholarship.

Referenced in this article:Geoffrey Nunberg's original post in the Language Log Blog, with comments from Google representatives

Geoffrey Nunberg's article about Google Books in the Chronicle of Higher Education.

from: Salon.com

No comments:

Post a Comment