June 28, 2017

Ending the Invisible Library | Linked Data

To explain the utility of ­semantic search and linked data, Jeff Penka, director of channel and product development for information management solutions provider Zepheira, uses a simple exercise. Type “Chevy Chase” into Google’s search box, and in addition to a list of links, a panel appears on the right of the screen, displaying photos of the actor, a short bio, date of birth, height, full name, spouses and children, and a short list of movies and TV shows in which he has starred. Continue typing the letters “ma” into the search box, and the panel instantly changes, showing images, maps, current weather, and other basic information regarding the town of Chevy Chase, MD.

The panels are powered by Google’s Knowledge Graph, a massive knowledgebase that launched in May 2012 with “more than 500 million [data] objects” drawn from sources including Freebase, Wikipedia, and the CIA World Factbook, “as well as more than 3.5 billion facts about and relationships between these different objects. And it’s based on what people search for, and what we find out on the web,” Amit Singhal, Google’s senior VP of engineering, wrote in an official blog post introducing the service. As the Knowledge Graph has grown, these panels have become an increasingly common feature that not only anticipates what users are searching for but also displays specific facts about people, places, historical events, and other topics in a convenient, quickly accessible format.

“As I talk to people about semantic technologies, and they ask whether it’s practical or whether anybody is using it, it’s like, no see this,” Penka says, noting that the technology is already in widespread use. As Penka explains, the World Wide Web began as a collection of web pages that were navigated with links. Now, and going forward, the web is increasingly about data and relationships among data objects.

“If you want to be in that web of data, semantic technology, linked data, BIBFRAME vocabularies, schema.org vocabularies, these are the ways you’ll speak to the web in a meaningful fashion,” Penka says.

ljx150202webenis2a

Off the MARC

Librarians have long been aware of this shift and have even longer understood that Machine Readable Cataloging (MARC) standards, developed at the Library of Congress (LC) in the 1960s, were becoming an anachronism in an increasingly networked world.

“MARC has always been an arcane standard,” OCLC senior program officer and LJ blogger Roy Tennant wrote in a much-cited, bluntly titled LJ column “MARC Must Die” in 2002. “No other profession uses MARC or anything like it. When we shop around for software to handle such records, we are limited to the niche market of library vendors. For their part, vendors must design systems that can both take in and output records in MARC format.”

MARC has since been left further behind. In 2011, Google, Yahoo!, and Microsoft Bing announced a joint alliance called schema.org, a collection of structured data ­schemas that enable web designers to include information in their HTML pages that identifies entities and relationships among entities. For example, in a recent column, ­SearchEngineLand.com’s Barbara Starr explains how a musician’s official website could use structured data markup to specify that the musician would be performing at a specific venue, with a specific address, on a specific date. When Google crawls the web, it can read this markup, improving search results and enabling Google’s Knowledge Graph panels to display results.

In this search engine environment, archaically formatted MARC records will get short shrift. As Penka notes in another example, using Google to search by name for many of the largest libraries in North America, such as the New York Public Library (NYPL), will pull up a Knowledge Graph panel, but it essentially characterizes NYPL as a business—with a logo, a primary location, and current leadership—rather than a knowledge institution.

Enter NYPL’s name along with the title of a recent best seller, and Google will return prominently displayed, appropriate links from OverDrive and Bibliocommons, since both systems present data in a way that can be crawled by Google. While many libraries will benefit from these OverDrive links if an institution’s entire name is searched along with a book title, most users simply do not use Google this way. And even if they apply these atypically specific searches, patrons of most libraries would see a page full of links to official library ­Facebook posts about a title, or events—often past events—and other information unrelated to a library’s holdings.

ljx150202webenis1a

Reframing with BIBFRAME

“What we need to do is not just talk amongst ourselves better, but we need to start communicating or formatting our data in such a way that we can be visible and seen by…other large organizations, such as the Facebooks and the Yahoo!s and the Bings and the Googles” of the world, Kevin Ford, then digital project coordinator in the Network Development and MARC Standards Office at the Library of Congress, said during a September 2014 segment of LC’s “Digital Future and You” webcast series that outlined the goals of LC’s Bibliographic Framework Initiative (BIBFRAME). “This is fundamental because so many of our users are beginning their search from these places.”

BIBFRAME had its genesis in 2011, following a test of the Resource Description and Access (RDA) descriptive cataloging standards conducted by the National Agricultural Library, the National Library of Medicine, LC, and two dozen other institutions. “Many of the libraries taking part in the test indicated that they had little confidence RDA changes [from Anglo-American Cataloguing Rules, Second Edition (AACR2)] would yield significant benefits without a change to the underlying MARC carrier,” Deanna Marcum, LC’s associate librarian for library services, wrote in “A Bibliographic Framework for the Digital Age,” an open letter explaining LC’s plans to develop a replacement for MARC. “Several of the test organizations were especially concerned that the MARC structure would hinder the separation of elements and the ability to use URLs in a linked data environment.”

Separately, the W3C Schema Bib Extend Community Group—a group of librarians, vendors, and organizations including OCLC—have been working to expand schema.org to better represent library bibliographic information for search engines (check here for recent LJ coverage regarding OCLC’s efforts, and see comment from Schema BibEx members Richard Wallis and Dan Scott below). As OCLC senior research scientist Carol Jean Godby acknowledges in the 2013 working paper “The Relationship Between BIBFRAME and OCLC’s Linked-Data Model of Bibliographic Description,” schema.org was not designed with libraries in mind.

“There is no representation of the FRBR [Functional Requirements for Bibliographic Records] Group I concepts Work, Expression, Manifestation and Item,” Godby writes. “There is no clear distinction between content and carrier. Very few relationships among creative works have been defined. There is no concept of collection or series. And there are no models of transactions involving library resources and the organizations that provide or receive them, such as libraries, universities, publishers, e-content aggregators, and data service providers.”

These are some of the reasons why LC opted to build BIBFRAME from the ground up. In May 2012, LC partnered with Zepheira to accelerate the development and launch of BIBFRAME, stating that “a major focus of the project is to translate the MARC 21 format to a Linked Data model while retaining as much as possible the robust and beneficial aspects of the historical format,” rather than, for example, augmenting schema.org to work with ­libraries.

Ultimately, LC plans for BIBFRAME to replace MARC and already offers a free tool to convert MARC records to BIBFRAME. But the standard’s vocabularies and the time line are still in flux. In the September LC webcast, Beacher Wiggins, LC’s director for acquisitions and bibliographic access, said that the transition could be as few as two years or as many as five years away, but BIBFRAME “definitely is going to lead us into the linked data world.”

Similar to schema.org, what the BIBFRAME standard will do, Ford explained, is structure library records in a way that enables Google or other search engines to “extract meaningful information…and then do something with it. Google can tell the searcher who just got this hit where [a resource is] located, such as the D.C. Public Library. Possibly even tell you the special location, that it’s available in the Southeast branch, the Northeast branch, the MLK branch. Could possibly go even further and say, hey, here’s the call number so the searcher can walk right through the door and just find it right on the shelf.”

Restructuring the budget, too?

Jeremy Nelson, metadata and systems librarian for Colorado College in Colorado Springs, notes that a structured data approach that works well with commercial search engines has the potential not only to transform the way library content is discovered but also to transform the way libraries spend on data management.

A commercial software developer prior to becoming a librarian, Nelson has been researching linked data and potential applications in libraries since 2008 and is currently working with the University of Denver, Colorado State University, and the Colorado Alliance of Research Libraries on the Redis Library Services Platform, which combines an open source, Redis-based BIBFRAME data store that he developed, with Colorado College’s open source, lightweight HTML5 Django mobile Aristotle Library Apps for discovering, accessing, and managing library information. Colorado College’s Tutt Library eventually plans to replace its ILS (integrated library system) with this platform.

Linked data can produce better search results, “and that’s probably the most end user, practical reason that I think libraries should support linked data,” Nelson says. “But I’ve been really frustrated as a librarian at small libraries, or even at large libraries, with the complexity and cost of these expensive enterprise ILSes…. My own research is trying to see how we can use linked data to replace these, really to me, archaic and hard-to-use legacy systems.”

Libhub: a BIBFRAME test case

Zepheira, which LC contracted to develop BIBFRAME, last year announced the Libhub Initiative, a proof-of-concept project that aims to build a network of libraries using BIBFRAME standards to link data between institutions and other resources, in order to illustrate how these standards can increase the visibility of library resources on the open web. Modern structured data markup, and links between institutions regarding that data, will help search engines confirm the authoritative accuracy of references to books, people, places, events, and electronic resources by the libraries using these standards, thus elevating library rankings in general search results, and potentially featuring library resources in structured data–­dependent Google Knowledge Graph panels.

“I find the opportunities around BIBFRAME exciting, because the promise goes well beyond simply replacing MARC behind the scenes…to empower[ing] libraries to speak in a unified voice that the web can understand,” Eric Miller, Zepheira cofounder and former leader of the World Wide Web Consortium’s (W3C) Semantic Web Initiative at the Massachusetts Institute of Technology, says in a webcast explaining Libhub. “The Libhub Initiative is focused on one goal: How can libraries, as a community, surface data in a meaningful, connected way that raises web visibility?”

Vailey Oehlke, director of libraries for Multnomah County, Portland, OR, which has shown early interest in ­BIBFRAME and Libhub, says that “this linked data format exposes [library resources] more broadly to the public.” Like Nelson, she later added that the standards could help ensure a library’s control over its own records, rather than entrusting that data to ­vendors.

“One of the things that informs our thinking, in large part, is the degree to which we’ll have control over our data,” she says. “And as is the case with most libraries, increasingly our data, access to it, and our ability to control it [are] more and more in the hands of vendors, whose bottom line, understandably, is their revenue. We simply have a different motivation, different needs. An opportunity, such as Libhub, that, on the face of it, appears to be not profit motivated but really about exposing that data…is very attractive to us.”

The Denver Public Library was the first public library to join the Libhub Initiative, and as collection services manager Rachel Fewell explains, “We want to be able to move forward and meet library customers where they are [on the open web]. We don’t want potential library customers to have to come to our website to understand what we are providing…. We work really hard on creating excellent catalog records, and we want to expose them to as many people as we can, so we saw this as the next logical step…. Why hide our hard work in MARC format that’s not available to the vast majority of the Internet?”

Are You Ready?

Zepheira also recently launched a Linked Data Readiness Assessment product to help libraries analyze “the state of their current collections, systems, and staffing to determine what should be done to make them more visible to their communities,” as well as a “Practical Practitioner” online education platform to help library leaders learn about the new standard and understand concepts of linked data more broadly.

The five-week online curriculum includes webcasts, hands-on exercises, regular “office hours” during which participants can discuss coursework with Zepheira executives and experts, community forums where participants can chat about coursework among themselves, and, following completion of the coursework, permanent membership in an online “alumni community” offering continued access to these resources and contacts.

Individual list price for the Practical Practitioner course is about $1,400, although Zepheira also negotiates group discounts for institutions that wish to have multiple staff complete the program.

This article was published in Library Journal. Subscribe today and save up to 35% off the regular subscription rate.

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Associate Editor, Technology for Library Journal.

Share
Create a Maker Program in Your Library
School Library Journal’s newest installment of Maker Workshop will feature up-to-the-minute content to help you develop a rich maker program for your library. During this 4-week online course, you’ll hear directly from expert keynote speakers doing inspiring work that you can emulate, regardless of your library’s size or budget. Course sessions will explore culturally relevant making and how to assess your community’s needs, mobile maker spaces, multi-media, and more!

Comments

  1. Igor Goldkind says:

    Excellent article accenting the positive impact of Linked Data systems.

  2. If by “A contingent of librarians, led by OCLC, have advocated for the use of schema.org as a foundation for the replacement of MARC, since these schemas are already supported by Google and other search engines” you mean the W3 Schema.org Bibliographic Extension Community Group (often referred to as “schemabibex”), I would suggest that is a gross mischaracterization of the group’s intent.

    schemabibex was not interested in replacing MARC with schema.org; rather, we were/are interested in figuring out how to best represent bibliographic information based on whatever descriptive format (MARC, ONYX, DC, EAD, etc) in schema.org for the purposes of improving discovery in general search engines, and in developing and proposing additions to the schema.org vocabulary that would better enable it to represent bibliographic information.

    Towards that end, I built schema.org support into the Evergreen, Koha, and VuFind open source library systems as a proof of concept. We used the Product/Offer model that schema.org adopted from the GoodRelations vocabulary to represent holdings, and we addressed the lack of vocabulary around periodicals, volumes, and issues, as well as the need for contains / containedIn and workExample / exampleOfWork relationships that fulfill a large number of FRBR. Current versions of Evergreen now express linked data by linking to Library of Congress, WorldCat, and OCLC Work Entities using sameAs and exampleOfWork relationships.

    For those interested in more information on using schema.org to represent bibliographic information for discovery in general purpose search engines, my SWIB 2014 preconference contains a set of self-directed tutorials: http://stuff.coffeecode.net/2014/swib_preconference/ — and I addressed the broader problem of simply getting visibility for libraries themselves in the local search option of Google, Bing, and Yahoo last month in my presentation at the OLA SuperConference: http://stuff.coffeecode.net/2015/ola_white_hat_seo/

    • Hi Dan,

      Thank you for your comment. I gave a more detailed explanation to Richard Wallis below, but I apologize that my brief mention mischaracterized the efforts and the organization of schemabibex. Reworded the graf and anchored a link directly to your comment within the body of the feature.

      -Matt

  3. With all due respect to Mr. Wiggins, there are at least two quotes in this article that should be attributed to Kevin Ford when he spoke ‘during a September 2014 segment of LC’s “Digital Future and You” webcast series’ [1].

    Just sayin.

    [1] http://www.loc.gov/today/cyberlc/feature_wdesc.php?rec=6401

    • Hi Kevin,

      I apologize for the misattribution. Looks like it was correct in my notes and I messed up when putting the story together. I’ve corrected it online (leaving one comment attributed to Beacher) and we will issue a correction in our next print edition. Thank you for pointing that out, and again I apologize for the error.

      -Matt

    • not surprising

  4. As Chair of the W3C Schema Bib Extend Community Group[1], I concur with the previous comments made by Dan Scott. The characterisation of OCLC leading a contingent of libraries to advocate… is misleading.

    Although employed as Technology Evangelist for OCLC, I set up the bibliographic community wide Group as a broad group with the objective of “discuss[ing] and prepare proposal(s) for extending Schema.org schemas for the improved representation of bibliographic information markup and sharing”. The group grew globally to include over eighty people including representatives from ILS vendors, publishers, aggregators, national libraries, research & other academic institutions.

    The group had several significant proposals accepted and included in the Schema.org vocabulary, as Dan describes. The result being that the initial limitations in using Schema.org for describing bibliographic resources, you highlight in both this and your previous article, from Jean Godby’s two year old report, are now not issues.

    As you correctly imply in the theme of these two articles, Linked Data and other Semantic Web technologies, when applied to the library domain, have the potential to deliver many benefits – from data quality improvements, distributed yet interconnected authority management, cataloguing efficiencies, more meaningful analytics and statistics, etc. The most obvious and powerful for the recognition of libraries and their resources, however, is in the area of discovery in the web – becoming part of the ubiquitous Knowledge Graph you describe.

    Linked Data and the many Linked Data vocabularies used by library pioneers in this space are leading towards the realisation of these benefits. Utilising Schema.org as a vocabulary for library domain linked data, as demonstrated by Dan Scott’s work with open source library systems, and its use in OCLC’s WorldCat.org and the publication of WorldCat Work and other up coming entities[2], is key to being recognised, visible to and referenced by, the search engines.

    As recognised by many commercial and other sectors behind the millions of internet domains that include Schema.org markup on their sites, you need to share data using the vocabulary that the search engines are looking for to get noticed. I suspect there are few libraries that would not see better web visibility for them and their resources as a good thing.

    I believe in today’s web that the use of Schema.org by libraries is a necessity to gain that visibility. However it does not negate the use and combination of other bibliographically focused vocabularies in the management and exchange of library specific data. One of the joys of linked data technologies is that they enable the use of more than one vocabulary at a time to satisfy various needs. A different way of thinking for those used to using a single data format to do everything, liberating and worth it for the benefits that it could bring.

    [1] https://www.w3.org/community/schemabibex/
    [2] http://www.oclc.org/data

    • Hi Richard

      I enjoyed your presentation at midwinter. Thank you for the clarification. The comments from Godby’s white paper were intended to illustrate, for a general audience (i.e. non-catalogers), why the Library of Congress chose to create BIBFRAME. The very brief mention of the work that OCLC and others have been doing with schema.org was not intended to be dismissive, and I apologize if my oversimplification mischaracterized those efforts. I will adjust that graf in the online version here and point to the comments from you and from Dan Scott.

      We posted coverage of your MW panel on schema.org a week ahead of the online publication of this story partly to ensure that regular readers had a bit of a more in-depth look at those efforts, and to link to that coverage within this feature. I recognize that that story is far from comprehensive, but this is certainly an area where LJ plans to continue coverage. Please be in touch, and thanks again for your comments!

  5. HI,

    Thanks for this informative article. It’s important to end invisible library linked data.

  6. This article is very good