To explain the utility of semantic search and linked data, Jeff Penka, director of channel and product development for information management solutions provider Zepheira, uses a simple exercise. Type “Chevy Chase” into Google’s search box, and in addition to a list of links, a panel appears on the right of the screen, displaying photos of the actor, a short bio, date of birth, height, full name, spouses and children, and a short list of movies and TV shows in which he has starred. Continue typing the letters “ma” into the search box, and the panel instantly changes, showing images, maps, current weather, and other basic information regarding the town of Chevy Chase, MD.
The panels are powered by Google’s Knowledge Graph, a massive knowledgebase that launched in May 2012 with “more than 500 million [data] objects” drawn from sources including Freebase, Wikipedia, and the CIA World Factbook, “as well as more than 3.5 billion facts about and relationships between these different objects. And it’s based on what people search for, and what we find out on the web,” Amit Singhal, Google’s senior VP of engineering, wrote in an official blog post introducing the service. As the Knowledge Graph has grown, these panels have become an increasingly common feature that not only anticipates what users are searching for but also displays specific facts about people, places, historical events, and other topics in a convenient, quickly accessible format.
“As I talk to people about semantic technologies, and they ask whether it’s practical or whether anybody is using it, it’s like, no see this,” Penka says, noting that the technology is already in widespread use. As Penka explains, the World Wide Web began as a collection of web pages that were navigated with links. Now, and going forward, the web is increasingly about data and relationships among data objects.
“If you want to be in that web of data, semantic technology, linked data, BIBFRAME vocabularies, schema.org vocabularies, these are the ways you’ll speak to the web in a meaningful fashion,” Penka says.
Off the MARC
Librarians have long been aware of this shift and have even longer understood that Machine Readable Cataloging (MARC) standards, developed at the Library of Congress (LC) in the 1960s, were becoming an anachronism in an increasingly networked world.
“MARC has always been an arcane standard,” OCLC senior program officer and LJ blogger Roy Tennant wrote in a much-cited, bluntly titled LJ column “MARC Must Die” in 2002. “No other profession uses MARC or anything like it. When we shop around for software to handle such records, we are limited to the niche market of library vendors. For their part, vendors must design systems that can both take in and output records in MARC format.”
MARC has since been left further behind. In 2011, Google, Yahoo!, and Microsoft Bing announced a joint alliance called schema.org, a collection of structured data schemas that enable web designers to include information in their HTML pages that identifies entities and relationships among entities. For example, in a recent column, SearchEngineLand.com’s Barbara Starr explains how a musician’s official website could use structured data markup to specify that the musician would be performing at a specific venue, with a specific address, on a specific date. When Google crawls the web, it can read this markup, improving search results and enabling Google’s Knowledge Graph panels to display results.
In this search engine environment, archaically formatted MARC records will get short shrift. As Penka notes in another example, using Google to search by name for many of the largest libraries in North America, such as the New York Public Library (NYPL), will pull up a Knowledge Graph panel, but it essentially characterizes NYPL as a business—with a logo, a primary location, and current leadership—rather than a knowledge institution.
Enter NYPL’s name along with the title of a recent best seller, and Google will return prominently displayed, appropriate links from OverDrive and Bibliocommons, since both systems present data in a way that can be crawled by Google. While many libraries will benefit from these OverDrive links if an institution’s entire name is searched along with a book title, most users simply do not use Google this way. And even if they apply these atypically specific searches, patrons of most libraries would see a page full of links to official library Facebook posts about a title, or events—often past events—and other information unrelated to a library’s holdings.
Reframing with BIBFRAME
“What we need to do is not just talk amongst ourselves better, but we need to start communicating or formatting our data in such a way that we can be visible and seen by…other large organizations, such as the Facebooks and the Yahoo!s and the Bings and the Googles” of the world, Kevin Ford, then digital project coordinator in the Network Development and MARC Standards Office at the Library of Congress, said during a September 2014 segment of LC’s “Digital Future and You” webcast series that outlined the goals of LC’s Bibliographic Framework Initiative (BIBFRAME). “This is fundamental because so many of our users are beginning their search from these places.”
BIBFRAME had its genesis in 2011, following a test of the Resource Description and Access (RDA) descriptive cataloging standards conducted by the National Agricultural Library, the National Library of Medicine, LC, and two dozen other institutions. “Many of the libraries taking part in the test indicated that they had little confidence RDA changes [from Anglo-American Cataloguing Rules, Second Edition (AACR2)] would yield significant benefits without a change to the underlying MARC carrier,” Deanna Marcum, LC’s associate librarian for library services, wrote in “A Bibliographic Framework for the Digital Age,” an open letter explaining LC’s plans to develop a replacement for MARC. “Several of the test organizations were especially concerned that the MARC structure would hinder the separation of elements and the ability to use URLs in a linked data environment.”
Separately, the W3C Schema Bib Extend Community Group—a group of librarians, vendors, and organizations including OCLC—have been working to expand schema.org to better represent library bibliographic information for search engines (check here for recent LJ coverage regarding OCLC’s efforts, and see comment from Schema BibEx members Richard Wallis and Dan Scott below). As OCLC senior research scientist Carol Jean Godby acknowledges in the 2013 working paper “The Relationship Between BIBFRAME and OCLC’s Linked-Data Model of Bibliographic Description,” schema.org was not designed with libraries in mind.
“There is no representation of the FRBR [Functional Requirements for Bibliographic Records] Group I concepts Work, Expression, Manifestation and Item,” Godby writes. “There is no clear distinction between content and carrier. Very few relationships among creative works have been defined. There is no concept of collection or series. And there are no models of transactions involving library resources and the organizations that provide or receive them, such as libraries, universities, publishers, e-content aggregators, and data service providers.”
These are some of the reasons why LC opted to build BIBFRAME from the ground up. In May 2012, LC partnered with Zepheira to accelerate the development and launch of BIBFRAME, stating that “a major focus of the project is to translate the MARC 21 format to a Linked Data model while retaining as much as possible the robust and beneficial aspects of the historical format,” rather than, for example, augmenting schema.org to work with libraries.
Ultimately, LC plans for BIBFRAME to replace MARC and already offers a free tool to convert MARC records to BIBFRAME. But the standard’s vocabularies and the time line are still in flux. In the September LC webcast, Beacher Wiggins, LC’s director for acquisitions and bibliographic access, said that the transition could be as few as two years or as many as five years away, but BIBFRAME “definitely is going to lead us into the linked data world.”
Similar to schema.org, what the BIBFRAME standard will do, Ford explained, is structure library records in a way that enables Google or other search engines to “extract meaningful information…and then do something with it. Google can tell the searcher who just got this hit where [a resource is] located, such as the D.C. Public Library. Possibly even tell you the special location, that it’s available in the Southeast branch, the Northeast branch, the MLK branch. Could possibly go even further and say, hey, here’s the call number so the searcher can walk right through the door and just find it right on the shelf.”
Restructuring the budget, too?
Jeremy Nelson, metadata and systems librarian for Colorado College in Colorado Springs, notes that a structured data approach that works well with commercial search engines has the potential not only to transform the way library content is discovered but also to transform the way libraries spend on data management.
A commercial software developer prior to becoming a librarian, Nelson has been researching linked data and potential applications in libraries since 2008 and is currently working with the University of Denver, Colorado State University, and the Colorado Alliance of Research Libraries on the Redis Library Services Platform, which combines an open source, Redis-based BIBFRAME data store that he developed, with Colorado College’s open source, lightweight HTML5 Django mobile Aristotle Library Apps for discovering, accessing, and managing library information. Colorado College’s Tutt Library eventually plans to replace its ILS (integrated library system) with this platform.
Linked data can produce better search results, “and that’s probably the most end user, practical reason that I think libraries should support linked data,” Nelson says. “But I’ve been really frustrated as a librarian at small libraries, or even at large libraries, with the complexity and cost of these expensive enterprise ILSes…. My own research is trying to see how we can use linked data to replace these, really to me, archaic and hard-to-use legacy systems.”
Libhub: a BIBFRAME test case
Zepheira, which LC contracted to develop BIBFRAME, last year announced the Libhub Initiative, a proof-of-concept project that aims to build a network of libraries using BIBFRAME standards to link data between institutions and other resources, in order to illustrate how these standards can increase the visibility of library resources on the open web. Modern structured data markup, and links between institutions regarding that data, will help search engines confirm the authoritative accuracy of references to books, people, places, events, and electronic resources by the libraries using these standards, thus elevating library rankings in general search results, and potentially featuring library resources in structured data–dependent Google Knowledge Graph panels.
“I find the opportunities around BIBFRAME exciting, because the promise goes well beyond simply replacing MARC behind the scenes…to empower[ing] libraries to speak in a unified voice that the web can understand,” Eric Miller, Zepheira cofounder and former leader of the World Wide Web Consortium’s (W3C) Semantic Web Initiative at the Massachusetts Institute of Technology, says in a webcast explaining Libhub. “The Libhub Initiative is focused on one goal: How can libraries, as a community, surface data in a meaningful, connected way that raises web visibility?”
Vailey Oehlke, director of libraries for Multnomah County, Portland, OR, which has shown early interest in BIBFRAME and Libhub, says that “this linked data format exposes [library resources] more broadly to the public.” Like Nelson, she later added that the standards could help ensure a library’s control over its own records, rather than entrusting that data to vendors.
“One of the things that informs our thinking, in large part, is the degree to which we’ll have control over our data,” she says. “And as is the case with most libraries, increasingly our data, access to it, and our ability to control it [are] more and more in the hands of vendors, whose bottom line, understandably, is their revenue. We simply have a different motivation, different needs. An opportunity, such as Libhub, that, on the face of it, appears to be not profit motivated but really about exposing that data…is very attractive to us.”
The Denver Public Library was the first public library to join the Libhub Initiative, and as collection services manager Rachel Fewell explains, “We want to be able to move forward and meet library customers where they are [on the open web]. We don’t want potential library customers to have to come to our website to understand what we are providing…. We work really hard on creating excellent catalog records, and we want to expose them to as many people as we can, so we saw this as the next logical step…. Why hide our hard work in MARC format that’s not available to the vast majority of the Internet?”
Are You Ready?
Zepheira also recently launched a Linked Data Readiness Assessment product to help libraries analyze “the state of their current collections, systems, and staffing to determine what should be done to make them more visible to their communities,” as well as a “Practical Practitioner” online education platform to help library leaders learn about the new standard and understand concepts of linked data more broadly.
The five-week online curriculum includes webcasts, hands-on exercises, regular “office hours” during which participants can discuss coursework with Zepheira executives and experts, community forums where participants can chat about coursework among themselves, and, following completion of the coursework, permanent membership in an online “alumni community” offering continued access to these resources and contacts.
Individual list price for the Practical Practitioner course is about $1,400, although Zepheira also negotiates group discounts for institutions that wish to have multiple staff complete the program.