October 20, 2014

Unlocking HathiTrust: Inside the Librarians’ Digital Library

What every librarian should know about this huge “digital library by libraries for libraries”

I had a great lesson in misperception correction not long ago as I was writing a LJ post called “A Rising Tide: The Academic User and the Ebook Experience,” which examined academic ebooks from a (critical) user perspective. The inspiration for this article came while working with a faculty member at UC Berkeley on trademark research. We discovered a series of promising volumes in HathiTrust—an emerging digital library stemming from research library collaboration with Google Books and other initiatives—only to find that poor scanning and metadata made the volumes virtually unusable.

At the time, I took this as evidence of widespread and often discussed problems within large digitized collections. In reality, I had stumbled upon a rather unusual error that, once brought to their attention, was rapidly addressed by HathiTrust staff.

Even after this positive resolution, lingering assumptions still led me to unintentionally mischaracterize the digital library landscape as I wrote. Luckily, when I worked with HathiTrust on fact checking, I received a rude (albeit politely communicated) awakening. After being graciously schooled on a point or two and rewriting half of the piece the night before submission, I started to wonder why, as a public services librarian and end user, I had harbored so many misplaced beliefs about digital libraries in the first place.

from the inside out

As librarians and users, we constantly encounter digital discovery interfaces and collections, but we don’t necessarily interact with the individuals who make them happen. This contributes to a widespread lack of insight into what these operations actually require. Who are the people behind the products? How do they work? When the human element is missing, interfaces can seem inscrutable.

The more I communicated with HathiTrust, the more I realized that they are a surprisingly small group of library-minded folks doing a herculean job not only participating in mass digitization projects with Google and the Internet Archive, but building a new, large-scale digital library with its own features and services. Enter the inspiration for this follow-up interview: to correct my (and others’) misperceptions about this important and emerging librarians’ digital library.

If you don’t know what HathiTrust is yet, you should. I set out to ask Heather Christenson and Paul Fogel, two individuals on HathiTrust’s front lines, to offer a rare inside angle on this vast (and growing) collection.


Char Booth: What’s the HathiTrust elevator pitch?

ht logo2 0 Unlocking HathiTrust: Inside the Librarians Digital LibraryChristenson: In a world where commercial forces have staked out wide swaths of digital territory, HathiTrust is a digital library by libraries for libraries, and on a huge scale. We continue and extend libraries’ traditional role —building, curating, and preserving useful collections and providing the best access possible to patrons—in the digital realm. HathiTrust is a collaborative enterprise, supported by funding and in-kind contributions from participating libraries.

And what’s with the name?

Christenson: Yes, that is a popular question! The name was chosen to express the values of our organization. Hathi (pronounced hah-tee) is the Hindi word for elephant, an animal noted for its memory, wisdom and strength – that’s the origin of the elephant on our logo. Trust is a core value of libraries.

How did HathiTrust come about?

Christenson: In 2008, major university libraries across the US came together to provide a common point to preserve and provide access to the millions of books they have digitized with Google and other partners. Several busy and collaborative years later, HathiTrust is a robust shared repository of books (and book-like volumes) that continues to grow at a rapid pace. The University of California is one of its founding partners, and the CaliforniaDigitalLibrary coordinates and contributes to activities across the UC campuses related to mass digitization (including HathiTrust).

What’s actually in HathiTrust? Hard numbers, if you please.

Fogel: As of the beginning of June 2011, there are:

8,793,479 total volumes
4,794,161 book titles
212,767 serial titles
3,077,717,650 pages
394 terabytes
104 miles of shelf equivalent
7,145 tons of print matter equivalent
2,398,461 volumes (~27% of total) in the public domain

Christenson: What’s going in encompasses an incredible variety of subjects, languages, geographic origins – essentially anything and everything in the collections of research libraries. It includes books, journals, federal state and local government documents, musical scores, expedition reports, magazines, pamphlets, and more.

In other words, HathiTrust is huge. Who can use it, and for what?

Christenson: Anyone with access to the web can go to the HathiTrust site to discover a rich variety of books, serials, and other publications. Each item has a persistent URL, so links can be shared.

Fogel: There are lots of ways to use HathiTrust. You can read books. You can download public domain volumes, either a page at a time or the full book in some cases. You can build your own collections of books, share them with others, view others’ collections and search across the full text of any (public) collection. Libraries have tons of access points into Hathi: APIs, metadata feeds, data sets, widgets. These can be used to build functionality that is limited only by the developer’s creativity. Mash it up!

Christenson: We’ve been hearing about lots of wonderful use cases: scholarly research, finding answers to reference questions, in-depth learning about historical events and genealogy, and, of course, simply reading.

Hathi1 Unlocking HathiTrust: Inside the Librarians Digital Library

How did all of the content you mentioned actually get into HathiTrust?

Fogel: Lots of hard work and collaboration around workflows, policies and standardization, which were all used to build a system automating the collective decisions made. There are lots of people and lots of machines really plugging away to make the process as smooth and efficient as it can be.

Christenson: It’s been heartening to see content from so many partners flowing in this past year (we report each month by partner in the HathiTrust newsletter available on our website). Bibliographic metadata goes in first, followed by book files. Metadata serves as a manifest. Ingest is coordinated—content from multiple partners goes in each month. For actively digitizing partners such as the University of California, the content is loaded as it gets digitized.

Who coordinates and contributes?

Christenson: It’s an evolving process—HathiTrust’s infrastructure is hosted at the University of Michigan (with a backup at Indiana University), so our UM team has provided primary support and development. Other partners have contributed to development work; for example, California Digital Library (CDL) collaborated closely with Michigan to build the path to get Internet Archive-digitized volumes into HathiTrust, paving the way for other partners.

Fogel: Beyond the organizational structure Heather described, there are a variety of working groups and committees attending to specific issues, such as usability, user support, discovery, full text search, and storage. These collaborations are made up of partner members interested in that particular area. There are also special development projects that individual partners have undertaken, such as a new metadata management system, or that multiple institutions collaborate on, such as developing the workflow for adding a new content type. There are also grants that are underway, such as Paul Conway’s work at Michigan to validate digital objects for specific uses.

What does HathiTrust do that Google Books or Internet Archive doesn’t?

Christenson: One of the most important distinctions is that HathiTrust has a stated intention to preserve digital volumes over the long term. Our goal is for the researcher to be able to use these items in 20 years, 50 years, and onward. This is an important core value of HathiTrust, bolstered by out recent certification under CRL’s Trustworthy Repositories Audit & Certification process.

HathiTrust content is representative of research libraries and reflects decades, even centuries, of curation by librarians. Although much of the content in HathiTrust is the product of scanning partnerships with Google, HathiTrust has an increasing amount of content not found in Google Books. We have included content from other major digitization projects such as Internet Archive and Microsoft-funded work with Internet Archive and other digitization agents. Some HathiTrust partners such as Michigan have contributed locally digitized collections as well.

HathiTrust also has a number of services and initiatives to make our data open and available, for example, we provide a regularly updated file of the aggregate HathiTrust bibliographic data so that libraries can tap into it to provide links in local catalogs and discovery tools. The HathiTrust Research Center has just been launched in an agreement with Indiana University and University of Illinois where scholars will soon be able to utilize the entire full text of our public domain volumes for computational research.

Why do you think HathiTrust is important to libraries, writ large?

Fogel: There is a lot of talk right now about a national digital library, and to me HathiTrust is a key effort in that direction that is already a reality. There are clear advantages to aggregation at this scale: breadth of coverage; heightened visibility; savings for individual institutions for content storage and resource sharing; the development of community standards and technical expertise. Ultimately it means that more users needs are served. Also, digital preservation—of the cultural record, of digitally born data—is a big deal. Libraries have long focused on preserving print collections and it only follows that they should also preserve digital materials. But preservation should be in the service of access and use. It is early days for the digital library, but it just isn’t efficient that every library—school libraries, public libraries, academic libraries—has to build their own services.

Christenson: Beyond the obvious benefit of broad access to an enormous aggregation of digitized volumes, HathiTrust is an important venue for libraries to focus collective attention on solving shared problems. We benefit from an economy of scale, driving storage costs down for a digital repository and services that we all share. With pooled records, we can take a more holistic look at our metadata and solve problems specific to how metadata designed for print associates with digital content, and how that plays out in discovery services. This shared digital collection also gives us an opportunity to coordinate collection management strategies across libraries and reduce local costs.

What is HathiTrust doing to make works more open?

Christenson: HathiTrust is the locus of activity for copyright review to open up digital volumes that may be in the public domain for access. Since 2008, the University of Michigan has been working on an IMLS-funded distributed system called the Copyright Review Management System (CRMS), including a database, that facilitates this research and enables us to share the work. A number of HathiTrust partners are now contributing to this effort, and many more will follow as the work expands. Through this work, over 125,000 volumes have already been released into the public domain. [See LJ's coverage of "HathiTrust's Copyright Detectives."-Ed.]

Fogel: There is a well-documented process and workflow to allow individual authors to open up their works as broadly as they are comfortable with. There are groups working on manual copyright review and the clearing of rights, and the organization has strong voice in the dialog around orphan works and should help to keep pushing it forward.

Hathi content growth Unlocking HathiTrust: Inside the Librarians Digital Library

What are some of the technical challenges you deal with?

Christenson: In some ways, technical matters are far easier than policy and organizational challenges. I would say is that the HathiTrust approach to technical challenges so far has been “bring ‘em on!” The HathiTrust full-text search is a case in point. University of Michigan developers built the full-text search, and as the HathiTrust partnership grew and content ballooned, they kept up, eventually hitting the boundaries of the indexing formatting technology we were using, and requiring an enhancement from the developers of this particular open source “search engine library.” We’re now indexing literally billions of pages. So here is a place where libraries are really stepping up to the plate to expand our technical horizons, dealing with digital content at a monumental scale.

Fogel: I personally enjoy the clash between the heterogeneity of content (including not just books, but images, multimedia, data sets) and the need to standardize and normalize objects for management & delivery. Also, how do you prioritize and select new content types? There is a definite tension in trying to develop services for new content types while still maintaining (and improving!) current or core services.

You mentioned “policy challenges”—such as?

Fogel: They are legion. Copyright and access rights. Fair use. Duplication within the repository between partners. Balancing the needs of all 50+ partner interests. Collection development. Quality of books and quality of services. All of these are complex issues, and there are many more.

What implications does HathiTrust have for print collections? Will contributing libraries start getting rid of their books?

Fogel: I hear that question a lot. Certainly some partner libraries may make that decision. Storing books is expensive. Given the state of library funding these days, everyone is trying to find ways to save money. But to my thinking, Hathi provides additional access points and new services such as full text search, data mining, etc. rather than an alternative to print collections. It is in the service of making all of the stuff in those books that much easier to discover.

Christenson: HathiTrust libraries are greatly interested in how our digital copies might be leveraged in the context of print management, given the high percentages of overlap between HathiTrust and academic research libraries reported in recent studies. Our Collections Committee is in the process of developing a proposal for collaborative print management activities. That said, I look forward to seeing more research from the library community on the relationship for users between print and digital versions. I think we have a lot more to learn about what user behavior surrounding the digital vs print dynamic really is. And of course, books that are in copyright are not viewable online, so that certainly must factor into decisions about print.

I described my own experience encountering a quality snag in HathiTrust, which turned out to originate with problems in the print copy that was scanned. Important question: what is your team doing to check the quality of items digitized by so many different sources?

Fogel: It’s undeniable that working at this scale and with these materials not every book will be perfect; actually it’s downright impossible. The most important thing is to make as much of this content accessible to as many people as we can right now. Improvement and cleanup tasks are going to occupy us for a long time, but we’re still doing everything we can to try to get it right now so that we can minimize that work.

When working with this many books, manual review is obviously a challenge, but there is a team that examines a percentage of all books to verify their condition and identify any possible problems in a copy. As you mention, these books have been scanned from library copies—books that have already lived a long life and have been well loved—and some enter the digitization process with flaws of their own. Aside from rigorous validation routines that occur during ingest, Hathi also tries to incorporate as many quality metrics as are available from digitizing agents. We’ve seen that the best and only way to really judge a book’s quality is to have a human look at it.

Where does the metadata in HathiTrust come from, and how is it being used?

Fogel: Metadata has its origins in partner catalogs. It is normalized, augmented and then “merged” to make it easier for users to find things in ways that make sense to them. There are a number of discovery interfaces, from standard ILS-type functions to a search across all of the nearly 9 million books. That’s really powerful and important. Hathi also takes pains to make it easy for other libraries to surface links to books in their own systems by supplying lots of different metadata feeds.

One characteristic of HathiTrust is that its work is spread among many geographically dispersed teams. What is it like to collaborate in this new kind of “network scale” organization?

Fogel: Collaboration at this level can be challenging, but it is very rewarding. Communications are probably the most difficult part, so strong project management and organization are vital. One of the best parts, though, is meeting, working, and spending time with cool people that you may have never met otherwise.

Christenson: It’s been very rewarding, and I think, empowering for libraries. Joining in on this collaborative enterprise has really forced us to be clear on our own organizational priorities, and has helped us consider what are the appropriate services for us to be providing at local versus “network” level. And the people involved have been learning how to communicate across distance and organizational cultures, developing common vocabulary, coming up with practices that can be shared, and generally developing great relationships and community with colleagues across the country—all activities that I think will be important to libraries going forward.

Agreed. Thanks, both.


Author Information
Char Booth (charbooth@gmail.com) is Instruction Services Manager & E-Learning Librarian at the Claremont Colleges Library. She blogs at infomational.com. tweets @charbooth, and is the author of Reflective Teaching, Effective Learning and Informing Innovation.
Heather Christenson is Mass Digitization Project Manager and HathiTrust Project Manager at the California Digital Library (CDL). http://www.cdlib.org/contact/staff_directory/hchristenson.html
A restaurant industry and dot bomb survivor, Paul Fogel is currently the technical lead for mass digitization and the co-technical lead for the HathiTrust at the University of California’s California Digital library. http://www.cdlib.org/contact/staff_directory/pfogel.html

 

Share