February 8, 2016

Project Gutenberg Meets WorldCat

Last week, I interviewed OCLC’s Bruce Washburn about an OCLC Research project called oclcBot—a program which takes book records in the Internet Archive’s Open Library and matches their ISBN numbers with their corresponding OCLC numbers. I recently found out how another massive book site makes use of OCLC records—though in a more low-tech way.

Project Gutenberg is home to some 33,000 public-domain ebooks, and has become a go-to destination for new e-reader owners looking for free reading materials. Librarians have often directed Kindle owners to Project Gutenberg to soften the blow of having no OverDrive ebooks available for the popular device (although Kindle OverDrive ebooks will finally become available later this year). The Colorado Library Consortium (CLiC), in collaboration with other Colorado libraries, created a set of MARC records last December for popular Project Gutenberg content (with direct links to the downloadable ebooks and audiobooks) that libraries could easily put into their own catalogs.

But some librarians may not be aware of the human factor behind all that digitized text. Project Gutenberg texts are often scanned by volunteers and run through optical character recognition (OCR) software. Human proofreaders are an integral part of the process, making countless small corrections to a text before it is posted.

A Project Gutenberg affiliate that does this work is Distributed Proofreaders, made up of hundreds of volunteers, which has completed proofreading of more than 20,000 works in just over a decade.

On occasion, just one missing page or ink-smudged passage can become a stumbling block to making a public-domain work available at all. Such problems are crowdsourced on the Distributed Proofreaders wiki, and a look at the “Missing Pages” wiki page provides a fascinating look at the huge amount of work that goes into Project Gutenberg’s corpus, as proofreaders offer up their requests to the community.

So how do these volunteers use OCLC records? The same way everyone else does: to find specific copies of books. The wiki provides WorldCat links to many “Missing Pages” books to help locate new copies to scan. There may even be a few at your library.

David Rapp About David Rapp

David Rapp (drapp@mediasourceinc.com) was formerly Associate Editor, LJ.

Create the Library Your Community or Campus Needs
LTC Online Course Join Library Journal and a roster of design experts for our latest 4-week interactive online course. Starting January 27, 2016, Library Design Workshop will guide participants through complex issues of library space design projects such as space programming, fundraising, and finding the right design team.
  • Develop a roadmap to create a flexible library space suited to your community.
  • Inspiring ideas, concepts, and perspectives from leaders in the library design field.
  • Build a framework to create a robust report for key stakeholders.
SELF-eLearn More
SELF-e is an innovative collaboration between Library Journal and BiblioBoard® that enables authors and libraries to work together and expose notable self-published ebooks to voracious readers looking to discover something new. Finally, a simple and effective way to catalog and provide access to ebooks by local authors and build a community around indie writing!
View TDS Archive
On October 14, 2015 Library Journal, School Library Journal, and thousands of library professionals from around the world gathered for the 6th annual Digital Shift virtual conference to focus on the challenges and opportunities presented by the digital transition’s impact on libraries, their communities, and partners. Now available on-demand, this year’s program provides actionable answers to some of the biggest questions our profession faces for and from libraries of all types – school, academic, and public and features thought-provoking keynotes from John Palfrey, author of BiblioTech: Why Libraries Matter More Than Ever in the Age of Google, and Denise Jacobs, tech leader, author, and creativity evangelist.