For all its use to researchers, the Internet can be an awfully ephemeral thing. Websites change hands, services that were once free land behind paywalls, and servers go offline. Whatever the reason, the result is the same—all too often, a once-valid link no longer directs users to the information they need. For many of us, the familiar 404 message, indicating that a page can’t be found, is a common but inconsequential hassle of Internet use. For scholars and legal professionals, though, being unable to find a piece of information cited in a court case can be a costly and time-consuming hurdle. Now Perma.cc, a new service spearheaded by the Harvard Law School Library, is aiming to put a stop to disappearing links to citations in legal documents and court decisions by creating individual caches of content at the moment that authors and journal editors cite it.
The problem of so-called “link rot”—when a link to a citation in a document or court case posted online no longer directs users to a valid page—and “reference rot”—when a citation links to a page that still exists, but no longer hosts the information cited—is growing more serious as information about court cases and legal decisions is increasingly stored and accessed online. Earlier this year, a report by the Georgetown University Law Library’s Chesapeake Digital Preservation Group (CDPG), which curates and archives legal decisions and government documents posted online, found that less than half (44 percent) of the URLs from its initial 2007 dataset remained intact six years later. That degree of link rot can present a real time suck for lawyers and researchers.
For more on the CDPG, check out our coverage on infoDOCKET
Harvard researchers working on the development of Perma found the problem to be even worse. Sampling from Harvard’s own publications found that more than 70 percent of citations linked online in the Harvard Law Review, Harvard Journal of Law and Technology, and Harvard Human Rights Journal led to dead ends that would not help readers learn more about the citation or fact-check the article. The problem wasn’t limited to Harvard, though—Perma team members also found that fully half of the URLs found in Supreme Court opinions posted online no longer link back to the information cited.
“It’s a huge issue. From a legal and academic perspective, citation is the foundation on which we build the progression of thoughts. To be able to get to source material is crucial for that system to mean anything,” said Shailin Thomas, a research associate on the Perma team. “As the Internet becomes more central to scholarship, more and more citations are going to be linked, and if 70 percent of them don’t lead to source material, how can you judge the ideas being put forth?”
Of course, the expiration of information on the Internet is not a new problem, and there are already a number of organizations, like The Internet Archive, devoted to helping preserve the information presented online. But each has its limitations. The small team of legal experts, researchers, and software designers working on Perma through a partnership between the Harvard Law School Library and Harvard’s Innovation Lab, are developing a lasting solution to link rot. When a link is plugged into the Perma.cc system—which is in beta testing right now—Perma makes a cache of the site at that time, along with metadata tags to identify when the permalink was created. It returns that information as a Permalink that authors or editors can link to without worrying that it will disappear one day. Perma also captures and stores a screenshot of the site, to preserve the cited URL as it existed and ensure that people following citations can go back to the information they’re looking for—not a ‘file not found’ page or different information which could muddy the waters about what a particular judge or justice meant to reference.
To ensure the links that the service generates stay valid, the organizers behind Perma envision using a distributed network to store the cached sites, with partner libraries about the country creating copies of the data to ensure redundancy. Partner libraries would also help to fund the operation of the service, which is planned to be free and open to the public when it leaves the beta testing stage.
“The project started here, but we don’t want it to end here,” said Thomas of Harvard’s role in the project. Perma already has partners in 26 academic libraries on campuses including Stanford, Boston College, and The University of Texas, and is also working with institutions like The Internet Archive and the Digital Public Library of America. “The goal is to have at least some of those libraries setup mirrors to the database so that if something here goes down, other versions of it will still be accessible.”
While Perma aims to make caches of everything exactly as it was when a researcher took a snapshot for citation, it won’t have access to material that’s behind a paywall or requires a subscription to view. If you try and make a Permalink of a Wall Street Journal article, for example, you’ll get an image of the advertisement asking you to buy a subscription – not the article you’re citing. “We’re building Perma with an eye towards respecting content providers,” said Thomas. “When Perma is released, it will give you a message telling you the content can’t be displayed, and ideally will have a place that you can contact the content owner or a library that can provide the information.”
Eventually, there may also be places for Perma to work alongside other programs working to tackle separate, but related, issues, like the Free Law Project at UC Berkeley, which is trying to bring the entirety of U.S. case law online and make it openly accessible. “Legal opinions are all public domain,” said Brian Carver, a professor at the UC Berkeley School of Information. “But we have very limited access to them unless we want to pay a fee. Carver said the database that Perma is developing could be a valuable asset in the future, and anticipates integrating Permalinks into the FLP’s growing law library.
As for when Perma will move out of beta and into the wild, the team is playing it by ear, but Thomas says the system will remain in testing at least until the distributed network is in place. The important thing, though, is making sure that the system doesn’t become overwhelmed on its release. “As we develop [Perma], we want to make sure it’s fully operational,” said Thomas. “We can’t afford to have it crash because we have a flood of activity and have journals unable to use it.”
|Data-Driven Academic Libraries is a free three-part webcast series, developed in partnership with Electronic Resources and Libraries (ER&L), that will touch on just some of the many areas where libraries are gathering, analyzing, and using data to change how they work—fueling your ability to better put this information to work in your own libraries.|