April 19, 2014

HathiTrust: 80 Terabytes & Growing

By Josh Hadro

The HathiTrust digital repository (see News, p. 13, for more background) is a massive undertaking: with content from just two of 23 participating universities added so far, the archive already includes nearly 80 terabytes of data, the digital equivalent of 25 miles of books on library shelves.

Though there are similarities to Google’s Book Search project in terms of scale, John Wilkin, associate university librarian of the University of Michigan and executive director of HathiTrust, says the two endeavors are quite distinct: “The primary difference will be in our commitment to long-term preservation of this information and Google’s commitment to access. That said, we will provide some minimal levels of access (public domain works, etc.), and we will work to identify specific scholarly needs that Google is less likely to serve. For example, data mining and large-scale linguistic computation is more likely to be in our bailiwick than Google’s.”

Of course, providing access to such an immense body of source materials even just for the public domain poses its own difficulties: “Full-text searching of a large number of documents will undoubtedly lead to the retrieval of a large number of results, and thus usability problems,” says the trust’s site. To combat this, the technology team supporting the HathiTrust is exploring innovative uses of faceted browsing to aid in navigation of search results, as well as other means of refining its “Large-Scale Search” for this prodigious endeavor.

Share