February 22, 2017

Project Gutenberg Meets WorldCat

Last week, I interviewed OCLC’s Bruce Washburn about an OCLC Research project called oclcBot—a program which takes book records in the Internet Archive’s Open Library and matches their ISBN numbers with their corresponding OCLC numbers. I recently found out how another massive book site makes use of OCLC records—though in a more low-tech way.

Project Gutenberg is home to some 33,000 public-domain ebooks, and has become a go-to destination for new e-reader owners looking for free reading materials. Librarians have often directed Kindle owners to Project Gutenberg to soften the blow of having no OverDrive ebooks available for the popular device (although Kindle OverDrive ebooks will finally become available later this year). The Colorado Library Consortium (CLiC), in collaboration with other Colorado libraries, created a set of MARC records last December for popular Project Gutenberg content (with direct links to the downloadable ebooks and audiobooks) that libraries could easily put into their own catalogs.

But some librarians may not be aware of the human factor behind all that digitized text. Project Gutenberg texts are often scanned by volunteers and run through optical character recognition (OCR) software. Human proofreaders are an integral part of the process, making countless small corrections to a text before it is posted.

A Project Gutenberg affiliate that does this work is Distributed Proofreaders, made up of hundreds of volunteers, which has completed proofreading of more than 20,000 works in just over a decade.

On occasion, just one missing page or ink-smudged passage can become a stumbling block to making a public-domain work available at all. Such problems are crowdsourced on the Distributed Proofreaders wiki, and a look at the “Missing Pages” wiki page provides a fascinating look at the huge amount of work that goes into Project Gutenberg’s corpus, as proofreaders offer up their requests to the community.

So how do these volunteers use OCLC records? The same way everyone else does: to find specific copies of books. The wiki provides WorldCat links to many “Missing Pages” books to help locate new copies to scan. There may even be a few at your library.

David Rapp About David Rapp

David Rapp (drapp@mediasourceinc.com) was formerly Associate Editor, LJ.

Tech Programming that Supports Digital Literacy
Coding Program WorkshopOver four weeks, this interactive online course will teach you how to run computer programming courses that introduce patrons to new career paths and technologies. We’ll explore all facets of building coding programming for your library such as making your case for funding, hosting Code Clubs and Hackathons, and curating free resources and technologies available online.
SELF-eLearn More
SELF-e is an innovative collaboration between Library Journal and BiblioBoard® that enables authors and libraries to work together and expose notable self-published ebooks to voracious readers looking to discover something new. Finally, a simple and effective way to catalog and provide access to ebooks by local authors and build a community around indie writing!