May 28, 2017

HathiTrust: 80 Terabytes & Growing

The HathiTrust digital repository (see News, p. 13, for more background) is a massive undertaking: with content from just two of 23 participating universities added so far, the archive already includes nearly 80 terabytes of data, the digital equivalent of 25 miles of books on library shelves.

Though there are similarities to Google’s Book Search project in terms of scale, John Wilkin, associate university librarian of the University of Michigan and executive director of HathiTrust, says the two endeavors are quite distinct: “The primary difference will be in our commitment to long-term preservation of this information and Google’s commitment to access. That said, we will provide some minimal levels of access (public domain works, etc.), and we will work to identify specific scholarly needs that Google is less likely to serve. For example, data mining and large-scale linguistic computation is more likely to be in our bailiwick than Google’s.”

Of course, providing access to such an immense body of source materials even just for the public domain poses its own difficulties: “Full-text searching of a large number of documents will undoubtedly lead to the retrieval of a large number of results, and thus usability problems,” says the trust’s site. To combat this, the technology team supporting the HathiTrust is exploring innovative uses of faceted browsing to aid in navigation of search results, as well as other means of refining its “Large-Scale Search” for this prodigious endeavor.

Josh Hadro About Josh Hadro

Josh Hadro (@hadro on Twitter) is the former Executive Editor of Library Journal.

Share
Spark Engagement Through Hands-On Learning
Registration is now open for the fourth installment of Maker Workshop. Back with all new content, this 4-week online course will explore emerging technologies, digital making, STEAM, and web literacy spanning programs and ideas for all ages. This time, we’ll feature special sessions devoted to teen programming ideas and education that you can emulate at your school or public library.