November 16, 2017

Wisdom of the Crowd | Digital Collections

In 2005, a team of marine biologists led by Glenn Jones of Texas A&M University used inflation-adjusted pricing data from the New York Public Library’s (NYPL) collection of 45,000 restaurant menus, among other sources, to confirm the commercial overharvesting of abalone stocks along the California coast beginning in the 1920s, a sharp decline in the East Coast’s inshore lobster population that began in the 1950s, and other information regarding historical stocks of swordfish, oysters, halibut, haddock, sole, and other popular seafood.

This goes to show how special collections can prove useful to researchers in varied and unexpected ways. Here, a novelist peruses NYPL’s unique menu collection to cook delectable dinner details into a historical mystery. There, a paleo oceanographer expands his field’s understanding of how seafood consumption trends of the 19th and 20th centuries impacted the sustainability of modern fisheries.

But the research conducted by Jones and his team was undoubtedly painstaking. A portion of the menus—which date from the 1840s to the present—had been digitized, but diverse typography and layouts made consistent transcription via modern optical character recognition (OCR) software impossible. Researchers looking to extract data from the collection simply had to sift through physical documents or online images and read them one by one.

Like many libraries in the past few years, NYPL has since asked the public for help with the collection. In April 2011, the library’s NYPL Labs team launched its “What’s on the Menu?” website and crowdsourcing project, funded partly by a $50,000 grant from the National Endowment for the Humanities (NEH) that was used to develop an online transcription tool. The project was an immediate hit with volunteers, who have since been transcribing as quickly as NYPL Labs can keep up with its digitization effort. To date, almost 430,000 dishes have been transcribed from almost 18,000 menus, making it possible to search the collection and extract useful data.

“What we’re trying to set up at NYPL Labs is the idea that digitization is the beginning of a whole life cycle, [followed by] transformations, further processing, and downstream uses, rather than an end point,” NYPL Labs director Ben Vershbow told LJ. Once upon a time, libraries “looked at collection steward­ship as ending with digitization—now it’s online, we’ve done it! And [digitization] is, clearly, a huge step forward. But, then, how do you really develop [an online collection] and create public value out of it? A lot of that involves crowdsourcing…one of the important tools that we can bring to bear on that question of ‘we digitized it, now what?’ ”

The business end

The term crowdsourcing has a decidedly for-profit origin. Wired contributing editor Jeff Howe coined the term in his June 2006 feature “The Rise of Crowdsourcing.” The deck of the feature made an explicit reference to outsourcing, drolly stating, “Sending jobs to India and China is so 2003.” The article went on to examine how corporations were beginning to leverage the Internet to distribute tasks that require human brain power, thereby lowering personnel costs while completing these tasks more quickly. Examples included the Eli Lilly–funded InnoCentive site and other such commercial crowdsourcing ventures as Amazon’s Mechanical Turk.

Yet the concept of tackling giant projects by parceling out work on the web has deeper roots in initiatives such as Distributed Proofreaders (DP), which was established 15 years ago to facilitate the conversion of public domain books into ebooks by distributing OCR cleanup work to volunteers. DP became a Project Gutenberg site in 2002, and its volunteers have ­processed more than 30,000 complete ebooks as of 2015.

Even at large libraries that have staff dedicated to digitization projects, the additional effort needed to enable researchers to extract data from these collections—such as transcribing OCR-resistant text, or adding item-level tags to large collections of images—would be an untenable chore for a library to take on alone. So, in the past half decade, libraries have taken cues from long-running projects such as DP, using crowdsourcing as a way not only to outsource work that would be impossible for staff to attempt but also to engage volunteers.

“In some ways [crowdsourcing] is a new set of tools to do things that libraries have done for a long time,” says Trevor Owens, senior program officer, National Digital Platform, for the Institute of Museum and Library Services (IMLS). “Volunteers on the web, tagging or transcribing this stuff, is a continuation of a long-standing commitment and set of relationships between volunteers and libraries, archives, and museums.”

Perry Collins, senior program officer in NEH’s Office of Digital Humanities, adds that libraries should always consider public engagement with a collection as an “end goal” to any crowdsourcing effort. “The goal is not only to create hundreds of thousands of tags,” she says. “A major goal is also to engage people in the digital humanities and in library collections. While the quality of what they do matters a lot, I think the process of what they do matters a lot, too.”

IMAGINE THAT (Clockwise from l.): A historical menu from NYPL’s collections; an illustration of a sandgrouse from the Biodiversity Heritage Library’s Flickr stream; and a screen from Tiltfactor’s “Stupid Robot” tagging game

IMAGINE THAT (Clockwise from l.): A historical menu from NYPL’s collections; an illustration of a sandgrouse from the Biodiversity Heritage Library’s Flickr stream; and a screen from Tiltfactor’s “Stupid Robot” tagging game

It’s official

Last August, IMLS announced a National Leadership Grant for Libraries award of $99,768 to the Tiltfactor Laboratory at Dartmouth College for the establishment of the Crowdsourcing Consortium for Libraries and Archives (CrowdConsortium).

“Given the increasing volume of digital content made available by libraries, archives, and museums, the practice of crowdsourcing holds great potential to address the challenge of providing item-level metadata needed to support online search and access,” Susan Hildreth, then IMLS director, said in a statement. “We anticipate that this project will lead to a community of practice that will not only help libraries and archives address their metadata needs, but also engage their virtual visitors in new and exciting ways.”

The NEH Office of Digital Humanities and the Alfred P. Sloan Foundation have provided additional funding support, and in May, CrowdConsortium presented “Engaging the ­Public: Best Practices for Crowdsourcing Across the ­Disciplines,” a three-day national meeting hosted by the University of Maryland.

With the aim of establishing best practices, shared toolsets, and strategies for crowdsourcing, CrowdConsortium represents an early milestone for the movement. Libraries need the assistance, and ideas that can help to keep volunteers engaged with meaningful contributions to these projects are becoming standardized.

Original grant recipient Tiltfactor Labs is an “interdisciplinary innovation studio dedicated to designing and studying games for social impact.” Led by Mary Flanagan, Sherman Fairchild Distinguished Professor in Digital Humanities at Dartmouth and now project lead for CrowdConsortium, Tiltfactor hosts “The Metadata Games.” By playing simple online games such as “Zen Tag” and “Stupid Robot,” players can help describe images and add tags to files from institutions including the British Library, Boston Public Library, Digital Public Library of America, and American Antiquarian Society, among others.

“Players get recognition for their knowledge, they get to have fun while exploring rarely seen artifacts, and they get satisfaction in contributing—and improving—the accessibility and value of an institution’s collection,” Flanagan said in a 2013 interview with Owens on the Library of Congress’s digital preservation blog, The Signal. “The project provides a path toward a deeper experience with the collections and the institution.”

Degrees of difficulty

The motivational effectiveness of games and “gamification” features such as contributor leaderboards is still being studied by the library crowdsourcing community, notes Grace Costantino, outreach and communication manager for the Biodiversity Heritage Library (BHL), an international consortium of 20 natural history, botanical, and research libraries headquartered at the Smithsonian. But many projects are experimenting with games as a component. BHL, for example, is a strong believer in the potential of crowdsourcing and is currently working on several initiatives, including the “Purposeful Gaming and BHL” OCR correction project, which includes a game custom-designed by Tiltfactor.

“All of the people involved with [BHL] are very keen on embracing new activities or technologies that would allow us to do new things in a fast way,” says Martin Kalfatovic, associate director, Smithsonian Libraries, and BHL program director, explaining the library’s use of crowdsourcing.

BHL has not shied away from setting a higher bar for participation than many other crowdsourcing projects. The library has a substantial Flickr stream of more than 160,000 images, and one of its longest running crowdsourcing efforts, launched in 2011, involves tagging content in that stream with “machine tags,” which include, in the following format, namespace:predicate=value.

Flickr was an early adopter and proponent of this tagging syntax a decade ago, and it still offers a straightforward way to add machine-readable metadata to image files, enabling much more functionality and specificity than hashtags or basic descriptive tags when filtering, searching, or using an application programming interface (API) with those files. For example, in this case, when a user tags an image in the BHL Flickr stream with a taxonomic species name, such as “taxonomy:binomial=Syrrhaptes paradoxus,” the online ­Encyclopedia of Life (EOL) is set automatically to flag and ingest the image into its “Species on Earth” collection, displaying it on a species page alongside related images, sound files, and videos of the Pallas’s sandgrouse.

Of course, by asking volunteers to use a specific tagging syntax and Latin binomial nomenclature as a naming convention, the trade-off is that BHL may also be limiting the pool of potential project participants.

“The struggle has been in the scaling,” Costantino says. BHL offers documentation for contributors and access to assistance via social media. But there is a learning curve, and it can help to have some level of scientific knowledge or search expertise when inscriptions on the images are difficult to read, for example.

Keeping it crowded

But if high standards have resulted in a smaller number of enthusiastic volunteers accounting for the majority of taxonomic tags in BHL’s Flickr stream, the project is in good company. Even projects that draw mainstream media coverage and would seem to have broad popular appeal, such as “What’s on the Menu?” at NYPL, tend to attract a core of power users who submit most of the work.

Understanding what motivates these major contributors is important for any project’s long-term sustainability. As Collins notes above, part of a library’s goal should be to ensure that these projects are meaningful and stimulating for all ­contributors.

“As crowdsourcing becomes more of a field, there are actually social scientists whose work is devoted to just that question: How do we get people motivated, why do they want to come to these [crowdsourcing projects], and even more so, why do they keep coming back when it’s not novel anymore?” Collins says. “Some of those questions are still hanging a bit.”

Obviously, participants find the work meaningful in some way and most likely discovered the project because it is in tune with an existing hobby or interest. So, Collins suggests, one key to motivation would be offering plenty of background on the project and explaining its goals or its potential scientific applications.

Costantino notes that participants in BHL’s Flickr tagging project can view, in part, the impact of their efforts when the images they tag are ingested by EOL’s website.

“If they see the results of their labor, that really helps generate renewed excitement,” Costantino says. She also suggests keeping lines of communication open for contributors, whether this is accomplished through social media channels or email. She adds that the variety of crowdsourcing efforts at BHL (which also include “Science Gossip,” an image description project involving Victorian-era citizen science journals, and “The Field Book Project,” involving the transcription of scientific field books from the Smithsonian) helps keep regular contributors from getting bored.

Ideally, a community of people with shared interests may emerge around long-running projects, generating a self-­sustaining support network.

Vetting submissions

Experts interviewed for this story say that trolling or vandalism tends to be a nonissue for most library crowdsourcing projects, particularly those that require volunteers to register with a verified email address in order to participate. However, typos and other mistakes are inevitable, and additional checks for accuracy and consistency are needed for projects involving a substantial number of online volunteers.

In this regard, Owens describes crowdsourcing projects on a continuum. On one end, there are projects like ­Zooniverse’s Galaxy Zoo. There, volunteers are shown a black-and-white image of an unclassified galaxy and are asked a series of ­multiple-choice and yes or no questions accompanied by simple graphics, such as, “Is this galaxy smooth and rounded, with no sign of a disk?” (alongside graphics depicting a flat, circular galaxy and a spiral galaxy) and “Could this be a disk viewed edge-on: yes or no?”

The multiple-choice format ensures consistency across input fields. To ensure accuracy, Galaxy Zoo simply has multiple volunteers view and classify the same image to correlate the correct answers. In aggregate, groups of volunteers have been proven to be as accurate as individual astronomers, Owens says.

At the other end of the continuum are projects like StoryCorps or the September 11 Digital Archive, organized by the American Social History Project at the City University of New York Graduate Center and the Roy Rosenzweig Center for History and New Media at George Mason University. There, contributors are asked to enter whatever they choose—such as a personal story—into a blank input field that asks, “How has your life changed because of what happened on September 11, 2001?”

“It is, from the very beginning, supposed to be about different perspectives,” Owens says. “In that case, the validity of those things is associated with [their being] personal experiences,” where consistency is not an issue.

Most library crowdsourcing efforts land somewhere in between, and many use a simple workflow in which contributors check one another’s work and/or flag mistakes.

At NYPL, for example, “We didn’t have massive technical resources to throw at [What’s on the Menu?],” Vershbow says. “We just have a basic honor system and a couple of safeguards built in to detect malicious behavior or hacking. Basically it’s just, ‘Here’s the menus, transcribe them; when you think it’s done, move it along, and it will go into a review category,’ ” until another volunteer confirms that it is correct.

This article was published in Library Journal. Subscribe today and save up to 35% off the regular subscription rate.

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Senior Editor, Technology for Library Journal.

Share

Comments

  1. I have read the article and I found it’s interesting. Informative post.-Creative Biolabs