September 22, 2017

(Meta)search Like Google

The time has come for libraries, too, to negotiate for rights to index full text

The ability to search and receive results in more than one database through a single interface – or metasearch – is something many of our users want. Google Scholar – the search engine of specifically scholarly content – and library metasearch products like Ex Libris’s MetaLib, Serials Solution’s Central Search, WebFeat, and products based on MuseGlobal used by both academic and public libraries – are all a means of providing this functionality. At the university where I work, without very much local advertising, Google Scholar has become the largest single source of links to our link resolver product, illustrating how hungry users are for metasearch.

Some librarians are concerned that metasearch is dumbed down and less powerful than native searching. However, it seems perfectly reasonable that many users want to learn and use only one search interface and to search multiple vendor-supplied databases at once. Is it possible to provide this service without any dumbing down? What future direction should metasearch take in libraries?

Cumbersome cross-search

Current library metasearch typically relies on searching multiple source repositories at once, in parallel, at the point of request and then merging the results. This technique is called cross-search, or broadcast search. This is such a common method of accomplishing this metasearch task that sometimes the phrases metasearch and federated search are used to refer only to this technique, but it is not in fact the only technique available.

Many of the problems and dumbed-down aspects of our current cross-searching metasearch implementations are in fact neither intentional nor inherent to metasearch but are instead consequences of the cross-search approach. At a 1999 digital libraries conference in Santa Fe that eventually led to the creation of the ‘Open Archives Initiative – Protocol for Metadata Harvesting’ (OAI-PMH), commonly used in digital library and digitized archives environments, the library history of cross-search via Z39.50 was considered, and certain fundamental inherent problems were identified. The criticisms include that our current cross-search metasearch tools are slow, can only provide limited results, and sometimes time out. The differences in query language syntax and search attribute variation mean that our current tools cannot do a very good job of allowing complex fielded searches and often resort to a lowest common denominator general keyword-like search. Our current tools do a poor job of relevancy ranking, deduplicating, and other problems related to merging results on-demand while a user is waiting for them. None of these problems are inherent to metasearch, but they are inherent challenges to the cross-search approach.

These can be addressed, and some recent library cross-search products have started to add advanced features, but that these problems recognized in 1999 still plague us today shows how big a challenge it is. Roy Tennant, user services architect at the California Digital Library (CDL) and an LJ columnist, has long advocated local indexing and says of the cross-search approach, ‘In a nutshell, our inability to do any preprocessing of the data is a major hindrance.’

Lithe local indexing

The Santa Fe conference birthed a better alternative. ‘One solution would be to get all the metadata records together in one place. The [prototype] brought to the Santa Fe meeting demonstrated a cross-archive digital library providing services based on a collection of metadata harvested from multiple archives,’ the Santa Fe report noted. OAI-PMH calls this approach harvest because the metadata is automatically harvested from multiple sources. We will call it a local index, because in some cases the data could be delivered from vendors instead of collected via harvesting, but since the data still ends up together in one place in a single index, harvest is really just a species of the same approach.

The local index approach was not only chosen for OAI-PMH but is also used by Google Scholar. Its index contains metadata, and often full text, from a wide variety of source databases, including many library-licensed content providers. When a user does a search, the Google Scholar software does not cross-search each source database. Google Scholar only has to look in its own local index, where it has already collected all of that metadata and text. This allows Google Scholar to avoid many of the pitfalls of our cross-search products. As our users know, Google Scholar is fast, and it never reports, ‘Sorry, content from database X cannot be included because the search timed out.’

Google Scholar software can analyze all of this harvested data and metadata at its leisure and store the results in advance of actual searches, instead of needing to perform on-the-fly processing while a user waits for results. This preprocessing can be applied to all the articles in the local index, not just the first 50 articles to be returned by each source in a cross-search. This makes more advanced relevancy ranking and deduplicating much more feasible, and Google Scholar performs these tasks, although far from perfectly, better than our cross-search products typically do. It’s not just because of Google’s deep pockets; it’s because the local index approach has certain inherent advantages over cross-search.

Why cross-search?

If it was clear even in 1999 that cross-searching was inferior to local indexing, why are we still doing it? Local indexing may take some technical expertise, but proper software should be able to ameliorate that. Off-the-shelf technology to index licensed content locally has been slower to appear than cross-searching software, but software products that support local indexing have started to become available.

The Endeca product North Carolina State University has used for its innovative faceted browsing library catalog is based on the harvester/local indexer, as are similar library market – specific products from Ex Libris, including Primo, the MetaIndex component to MetaLib, and the harvester in Digitool.

One reason these products may have been slow to arrive and still aren’t often thought of for a Google Scholar style indexing of licensed scholarly output is that such indexing requires not just local indexing technology but generally some cooperation from content providers.

In order to index locally, you must acquire the metadata locally (and possibly license full text for searching). Publishers and vendors have not necessarily been eager to let libraries have the metadata, and libraries may not have had the infrastructure in place to give it. But content providers are starting to provide this information to Google and Google Scholar. Everyone wants to be found on Google, and this requires making sure Google has the ability to index your metadata and full text. EBSCOhost Connection and Gale AccessMyLibrary have both put their metadata on the public web for indexing by Google and other spiders.

If Google is doing it…

If Google Scholar were perfect (and we didn’t mind having only a single source of metasearch service), most libraries wouldn’t need to provide metasearch. The most obvious and frequently cited problem with Google and Google Scholar is the lack of exhaustivity: you can’t know precisely what the Google index includes and what it leaves out. There’s no guarantee that all of your library’s expensive licensed content, which you want to make sure users can find, is included in Google Scholar. As Tennant explains, ‘What you don’t search is as important as what you do. If you cast your search too broadly you increase all the noise and false drops in the results.’ That’s why CDL is creating metasearches tailored to specific disciplines, subjects, or audiences. Google’s approach to collecting information to index has been to spider the web, without coordination with the content providers. While this results in a large content set, it also results in an indexed collection with unclear boundaries and fairly unstructured data. Google Scholar has more coordination with providers (in some cases apparently indexing content that is not available to the general public), but it uses an approach that results in a similar index product. But this is not the only way to acquire data for local indexing.

If libraries licensed full text or metadata by cooperating with the content provider, they could know exactly what they have in their index and be assured of its completeness. The indexed data could have fielded metadata suitable for more powerful searching. When locally indexing data, the library can still cluster this data into subject- or disciplinary-specific meta-searches. There is no reason to provide only a general universal search like Google Scholar does. Indeed, when locally indexing instead of cross-searching, these clusters can divide content within and between vendor boundaries in ways that are impossible with cross-search.

Local indexing won’t solve all problems with metasearch automatically. It paves the way to some solutions but also makes other problems more apparent. Using controlled vocabularies for retrieval will never work well across databases that use different vocabularies, and even complex fielded searches will never work as well as we’d like until there’s better metadata harmonization across databases. We also won’t always be able to get access to everything we might want to index locally. So for the foreseeable future, metasearch will not completely replace native interfaces. Local indexing is our path to making metasearch work as well as possible, not by dumbing it down but by ‘smartening it up.’

Can libraries get the content?

Google Scholar may well be the catalyst that makes it possible for libraries to get access to licensed full text and metadata. The Scholars Portal Project (see ‘Ontario Scholars Portal,’ below) has generally been able to get metadata from publishers and providers suitable for local indexed metasearch at little or no extra cost. Alan Darnell, manager of the project, suggests that the content providers realize that libraries providing their own locally indexed search services means less reliance on publisher-supplied search services, which ultimately saves publishers money. In some cases, the XML metadata feeds received by Scholars Portal even include full-text streams suitable for query matching and advanced presentation, although the user would still be directed to a complete digital copy to view or download the complete article.

The Scholars Portal Project pursues perpetual access rights for archival control of electronic text. It makes sense from a market perspective that this is going to cost more, since once you have perpetual access rights, you can continue to access content without continuing to pay a provider. The provider is going to charge you more up front to make that feasible for its business, and Kit Wilson, manager of the Lois Hole Campus Alberta Digital Library (ADL), confirms that ‘it’s much more expensive to contemplate that kind of collection.’ There are, of course, important reasons for libraries to pursue archival perpetual access, but there are necessarily financial implications.

Libraries may want to explore the possibility to acquire more affordably rights to load and index full text locally without perpetual access, in order to allow local indexing only so long as you continue to license the content. If not full text, even just locally loaded metadata can be effective. Wilson says the industry is still in the initial stages of routinely licensing any of these rights and that sometimes aggregators don’t have the right themselves to provide the licensing the ADL is interested in. Libraries need to start entering into these discussions with vendors in order to develop a market that will give us feasible licenses to provide the advanced services we want to give to our users.

Additional weapons

Exploring issues of metasearch shows that Google is neither libraries’ replacement, savior, nor enemy. Google provides additional tools, which have certain advantages and disadvantages like all tools. We can learn from it and use what we learn in trying to develop tools that deliver better service – whether through local development or demand in the software market. Since Google is more than just ‘another tool’ with regard to its market share and effect on the market, it has a positive effect on libraries to the extent that it has made metadata more freely available, unless only Google can acquire this free metadata.

All libraries are increasingly becoming digital libraries. Technologies we may be used to thinking of as belonging to the realm of digital libraries will increasingly be integral to the services of all libraries. OAI-PMH not only has lessons for licensed content metasearch, but it could be used for the orderly transfer of licensed content from providers to indexers.

Finally, local indexing is the likely future of metasearch in libraries – a metasearch that can be both more powerful and easier to use than what libraries currently have. Both the Ontario Scholars Portal Project and the Alberta Digital Library are large consortial projects that have the resources to start approaching these solutions at the beginning of an adoption curve. But all libraries should consider investigating licensing that will support locally indexed metasearch as soon as possible, so when the time is right to implement local index technology for metasearch, we will already have the licensing in place or available to support that move.


Jonathan Rochkind has an MLIS from the the University of Washington Information School, Seattle, and works in the Systems Department of the Johns Hopkins Universities’ Sheridan Libraries, Baltimore

LINK LIST
Access My Library
accessmylibrary.com
EBSCOhost Connection
connection.epnet.com
Ex Libris
www.exlibrisgroup.com
Google Scholar
google.com/scholar
Ontario Scholars Portal
www.scholarsportal.info
Open Archives Initiative
openarchives.org
WebFeat
webfeat.org

Ontario Scholars Portal

The Scholars Portal Project of the Ontario Council of University Libraries aims to bring together a number of key digital services for member libraries. Alan Darnell, manager of the project, says they decided that ‘single search’ was a key service to be provided and that for the reasons of performance, reliability, and flexibility, local-loading of article metadata was the best approach. Darnell says that with Google Scholar, ‘It’s not always clear who the publishers are, the scope of the content [or] the depth of the content…. Even if Google Scholar had the coverage, we’d still feel the responsibility to do the local loading. I don’t think there’s any reason we can’t do it better [than Google].’

Darnell says they have succeeded in creating a metasearch environment through local loading that works better and more powerfully than typical cross-search: ‘In a [cross-search] environment, after you’ve done that first search, you need to drop into the native interface to do anything else. In our [locally indexed] single search, [users] can stay within our environment longer, although eventually they might still need to drop into the native interface.’ The Scholars Portal single search also allows member libraries to create custom subject or disciplinary ‘clusters’ by selecting from available databases that have been locally indexed.

The Scholars Portal single search is only in its beginning stages, but Darnell says that initial evidence suggests success: ‘Our usage numbers show that if you make it easier to search, users get presented to the journal literature in greater numbers than they would if they only had the specialized search engine. There’s always time later on to teach them specialized searching if that’s what they want to do. I don’t see any disadvantages in lowering that learning curve.’ Just as importantly, with the local indexing technique, the power and features of a metasearch tool can be significantly increased, especially if you can get access to full text to load as an aid to the searching algorithms: ‘As we do get more and more full text, we will be able to get meaningful relevancy ranking, useful clustering, more advanced presentation of result sets,’ and other advanced features that will serve more needs without requiring the user switch to a ‘native’ interface.

Share
Create a Maker Program in Your Library
School Library Journal’s newest installment of Maker Workshop will feature up-to-the-minute content to help you develop a rich maker program for your library. During this 4-week online course, you’ll hear directly from expert keynote speakers doing inspiring work that you can emulate, regardless of your library’s size or budget. Course sessions will explore culturally relevant making and how to assess your community’s needs, mobile maker spaces, multi-media, and more!