November 19, 2014

HathiTrust Doubles DPLA Collection with More Than Three Million Books

The HathiTrust Digital Library will become The Digital Public Library of America (DPLA)’s single largest content hub, the two institutions announced on June 18.

The metadata records associated with some 3,384,638 volumes (and growing daily) held by the HathiTrust will be accessible on the web at dp.la, and through the DPLA application programming interface (API). (The digitized volumes themselves will continue to reside in HathiTrust.)

“HathiTrust’s joining the Digital Public Library of America more than doubles the size of our unified collection, and—as so many have asked for—fills it with millions of books. We couldn’t be more delighted. Over the last five years, HathiTrust has built an incredible digital infrastructure to store the scanned holdings of its many university and library partners, and we in turn look forward to providing a large general audience for these valuable works, and new pathways into them,” said Dan Cohen, DPLA’s executive director.

Lynne Raughley Communications Director, University of Michigan Library, and a spokesperson for the HathiTrust, told LJ, Staff from DPLA and HathiTrust have been collaborating on large-scale metadata specifications and transfer. DPLA has already done a test import of some HathiTrust data and it is currently in its development environment for evaluation. We anticipate making this content available through the DPLA website and API in the next four to six weeks.”

HathiTrust Executive Director John Wilkin (soon to be stepping down from that position) stressed that HathiTrust will continue to enhance its own discovery and access platform, but said that DPLA puts HathiTrust’s collection before a broader audience, alongside innovative search and use tools, including timelines, maps, and a growing number of apps.

DPLA will be working to add a special interface for books to supplement its map and timeline browsing interfaces, but the HathiTrust content will be available through the current site as soon as the data is loaded.

Of the existing apps, Cohen told LJ, DPLA suspects that StackLife, an innovative tool from Harvard’s Library Innovation Lab for browsing, searching, and reading books made available via the DPLA, will take particular advantage of the new material from HathiTrust.”

Why Not Everything?

Raughley told LJ  that the volumes (about a third of the trust’s 11 million total holdings) include “those works that are in the public domain worldwide, some that are in the public domain according to U.S. law (and thus to readers in the U.S.), and some that are [Open Access] (by designation of the rights holder).”

According to Raughley, “Many of these volumes (in all categories) were digitized by Google; some were digitized as part of other large-scale digitization efforts (e.g., Microsoft and [the Internet Archive]); some were digitized by partner libraries.” Google-digitized materials come with some restrictions on reuse of digital images and OCR. But because the vast majority of the content DPLA exposes is in the public domain, users may not be expecting to encounter such restrictions. It’s not yet clear how HathiTrust will adjust its practices regarding conveying these limitations to users.

The ongoing litigation between the HathiTrust and the Authors Guild does not offer any impediment to the current plan. But, Raughley said, “While the lawsuit will not restrict access to the works now known to be in the public domain, it could limit HathiTrust’s ability to make copyright determinations on the many other works in the nearly 11 million volume collection.”

Dancing Around the Copyright?

The Authors Guild lawsuit is not the only dispute which could have cast a pall over this union. As Karen Coyle wrote about in February, there is a conflict between the DPLA metadata policy, which states “the DPLA believes that the vast majority of metadata is not subject to copyright,” and OCLC’s, which claims copyright in WorldCat as a compilation, though not of individual records.

The partnership went with DPLA’s fallback position, in which metadata is contributed under the terms of a Creative Commons “CC0” license, with OCLC’s support. Sandy Yee, Chair of the OCLC Board of Trustees, said in a statement that DPLA’s Data Use Best Practices, which request that users provide attribution to metadata providers, are in keeping with OCLC community data norms.

Coyle told LJ, “It appears that this was much negotiated with OCLC,” back when Harvard first released its metadata last year. OCLC normally recommends that records be released under the Open Data Commons Attribution License (ODC-BY), which requires attribution. Harvard went with the CC0 license instead, but specifically asked users to follow WorldCat community norms for WorldCat-derived data, in addition to Harvard’s own.

Coyle cited a post by Jim Michalko, vice president, OCLC Research Library Partnership, on the OCLC blog “Hanging Together” April. Michalko said then, “We think this is a well-intentioned and executed compromise. It’s true we don’t think that public domain dedications for data derived from WorldCat are consistent with the OCLC cooperative’s norms… We also recognize that the WCRR [WorldCat Rights and Responsibilities] statement is not a legally binding document and that interpretations of these community norms within the cooperative may differ.”

Said Coyle, “From the point of view of data sharing, which is one of the primary goals of DPLA, only data that is unfettered by license conditions is suitable to be mixed into a large melting pot of metadata from a wide variety of sources. Technically, this really is the only acceptable solution.”

This article was featured in Library Journal's Academic Newswire enewsletter. Subscribe today to have more articles like this delivered to your inbox for free.

Meredith Schwartz About Meredith Schwartz

Meredith Schwartz (mschwartz@mediasourceinc.com) is Senior Editor, News and Features of Library Journal.

Share