November 20, 2017

OCLC Works Toward Linked Data Environment | ALA Midwinter 2015

OCLC linked data illustrationIn the early days of the World Wide Web, many libraries blocked search engines from crawling their sites and used other means to insulate library records from the open web. “Some of the reasons were cultural, some of them were fears of commercial takeover of our resources…but we actively disengaged,” said Richard Wallis, technology evangelist for OCLC and chair of the W3C Schema Bib Extend Community Group during the “OCLC Links and Entities: The Library Data Revolution” session at the American Library Association’s 2015 Midwinter conference in Chicago.

Since then, despite several projects that have started to share library data with the wider web, “a lot of that reticence to engage, historically, is still there,” Wallis added. Yet the library community is increasingly aware of the need to enhance the discoverability of library resources on the open web using Google and other popular commercial search engines.

In an effort that parallels and complements the Library of Congress’s Bibliographic Framework Initiative (BIBFRAME), OCLC is using schema.org to help facilitate this discoverability using linked data, explained Wallis and his colleague Ted Fons, executive director, Data Services and WorldCat Quality Management for OCLC.

The “Web of Data”

Using embedded metadata or microdata to optimize search engine rankings and drive relevant traffic to websites became an increasingly common practice during the past decade. The better a search engine crawler or algorithm can verify what a web page is about, the better ranking within search results.

In 2011, Google, Yahoo!, and Microsoft Bing jointly announced the launch of schema.org, a structured data markup vocabulary that enables webmasters to nest metadata within HTML5 web pages, using microdata tags that all three search engines will recognize (see a summary of an example from the schema.org website below). The standardized system has accelerated the adoption of structured data markup, and has enabled new search engine features, such as Google’s “Knowledge Graph” panels, which display basic information about entities such as people, places, and events alongside search results.

Enter Entities

“Instead of describing a page or showing you a page of information as a [search] result, [structured data markup] is starting to describe a thing on the planet,” Wallis explained.

The transition from a World Wide Web of linked pages toward a “Web of Data” is well underway, but library data—in particular library catalog data—is being left behind.

“We have a web of data that is forming,” Wallis said, displaying a PowerPoint slide with a visual depiction of complex, linked relationships between various entities. “All of these Webs of Data pictures start to look a bit galactic to me, but to follow that analogy, we’re in a position at the moment where we’re in danger of a library-shaped black hole appearing in this Web of Data. We’re not very visible on the web at all.”

In this environment, the MARC (MAchine-Readable Cataloging) standards that libraries have used for forty years to format cataloging records for computer systems present two key problems in terms of discoverability. First, MARC does not structure metadata about library resources in a way that modern search engines—or any non-library software—will recognize.

Second, Wallis distinguished between MARC records and the way in which structured data markup defines “entities.” A MARC record will contain a lot of information about a specific resource, including the title, author, and publisher of a book, for example. Yet a single MARC record is essentially isolated. Like an old card catalog record, it points to a single resource. By contrast, using structured data to define an entity, such as an author, enables a computer system to link that author to other entities—other books he or she has written, other people he or she is associated with, places, etc.—via uniform resource identifiers (URIs) and web addresses where related structured data is present.

Schema.org, OCLC, and BIBFRAME

For library data, Schema.org does have shortcomings. As OCLC Senior Research Scientist Carol Jean Godby notes in “The Relationship between BIBFRAME and OCLC’s Linked-Data Model of Bibliographic Description: A Working Paper,” Schema.org has defined very few relationships among creative works, there is no way to describe collections or series, there is no way to represent the Functional Requirements for Bibliographic Records (FRBR) concepts of “work,” “expression,” “manifestation,” and “item,” and “there are no models of transactions involving library resources and the organizations that provide or receive them, such as libraries, universities, publishers, e-content aggregators, and data service providers,” Godby writes.

In terms of a cataloging standard, schema.org’s basic microdata tags offer a coarse level of granularity comparable to a Dublin Core record, Godby explains. Catalogers can record information such as “creative work, person, author, director, place, organization, publisher, copyright date, book, ISBN, and so on.”

The structured data efforts of the Library of Congress and OCLC diverge, if only somewhat. With BIBFRAME, the Library of Congress has opted to address the shortcomings of schema.org by starting from the ground up, building a model that can replicate the data contained in MARC records and expose it using linked data techniques. Meanwhile, with Google, Bing, and Yahoo! supporting schema.org, OCLC views the standard as vital if libraries want to see their content exposed on the open web. The organization has been working to expand Schema.org vocabularies to better accommodate library data. As Godby explains, Schema.org’s designers “never intended it to be a global ontology. Rather, Schema.org is proposed as a starting point, or a scaffolding to which a more detailed ontology designed by interested communities of practice would be attached.”

To that end, OCLC has been centrally involved with the World Wide Web Consortium’s (W3C) Schema Bib Extend community group that aims to expand schema.org markup language to include additional bibliographic information and other data used by libraries.

In “Common Ground: Exploring Compatibilities Between the Linked Data Models of the Library of Congress and OCLC,” a paper published in January, Godby corroborated with Ray Denenberg, an executive at the Library of Congress. The authors note that the BIBFRAME initiative and OCLC’s efforts with schema.org share several similarities, although OCLC is primarily focused on improved visibility on the web, while BIBFRAME aims to address the long-term curation needs of libraries.

The Library of Congress has “a bottom-up approach, starting off with…very detailed MARC data that’s there not only for discovery purposes, but for curation purposes, and how would we expose that using linked data techniques,” Wallis said. “Whereas, from a schema.org point of view, we have a general purpose vocabulary, which, with some lobbying, [OCLC has] made even more library friendly.” Wallis added that the Godby / Denenberg paper “is a very intriguing document, and demonstrates how compatible these approaches are with one another.”

Schema.org Simplified

The same entry with schema.org microdata tags would look like this:

The HTML example only tells a web browser how to display a few lines of human-readable information, with a hyperlink to another page on a server. But this markup does not give a search engine any machine-readable information about what the text means. Adding the schema.org “itemscope” attribute notifies a search engine that the information within a <div> tag container is about a specific entity. The “itemtype” attribute defines the type of entity being described, such as “movie” or “person.” And the “itemprop” attribute enables webmasters to include additional information regarding different properties of that entity, such as a person’s name or birthdate, or the genre of a movie.

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Senior Editor, Technology for Library Journal.

Share