It took Harvard Law School (HLS) nearly 200 years, since its founding in 1817, to amass its collection of United States case law reporters—one of the world’s largest collections of legal materials. It will take the HLS Library about three years to scan and digitize that collection and, in partnership with legal technology startup Ravel Law, make it freely available to the public online. If all goes according to plan, by early to mid–2017, the “Free the Law” project will have digitized the “official print versions of all historical U.S. court decisions,” according to the HLS Library blog.
This will encompass the contents of 40,000 books—approximately 40 million pages of law—with publication dates from the 1700s to the present day. The large-scale digitization project is gaining considerable attention in the legal information field, where most case law is organized in expensive commercial databases or, where it is free online, may suffer from considerable gaps in coverage. That the common law is not freely accessible online, according to an overview posted on the HLS Library blog, “impairs justice and equality and stifles innovation.”
Free the Law started with a grant-funded pilot program in 2013: over the course of approximately eight months, the Harvard Library Innovation Lab scanned just under a million pages, honing in on best practices for document processing. The partnership with Ravel, a California-based legal research and analytics company, followed, and at the end of July 2015 Ravel and HLS finalized an agreement setting forth their respective obligations and responsibilities.
PAPER TO PIXELS
A project team on-site at HLS was assembled in August 2015, and has been processing ever since. Six employees work full-time during library operating hours, and many others have contributed to the work on a part-time or temporary basis.
The volumes being scanned are all stored at a facility 30 miles away from the HLS campus, and each Wednesday 600 volumes are delivered to the Innovation Lab. Once unloaded, each volume is analyzed and its details are entered into a web-based tracking tool developed at HLS for this project. Every volume receives a cover sheet with identifying data and a unique barcode, so that it can be tracked through each step of the workflow. This is also the point at which volumes are analyzed for conservation issues, like ripped or missing pages, and considered for rare book/special collections status, in which case the volume will be kept in its original binding and its pages photographed at Harvard’s Widener Library.
Analyzed volumes are then prepped for scanning. Damaged pages are repaired with tape or heat-set tissue, and pages are hand-sliced from their bindings and divided into three sections. Those pages are placed in a large power cutter that, exerting approximately 5,000 pounds per square inch of pressure, chops off the glued and sewn spines, leaving a clean edge.
The resulting loose pages are run through a high-speed scanner. The scanner model used at HLS, an ImageTrac Intelligent Scanner manufactured by the company IBML, is widely used in the financial services industry, but has never before been used for a cultural heritage project like this one. At its best, the scanner can scan about 10,000 pages per hour (even at its slowest it can manage 3,000 pages per hour), and it runs about 50 to 60 hours per week. Once scanned, the content of each volume is saved as TIFF and JP2 files and receives a quality control review—checking for missing or duplicate pages, cropping errors, or other problems. The loose pages are organized and cleaned, reunited with their binding and cover sheet, and vacuum sealed. The completed volumes will be transferred to underground storage vaults in Louisville, KY.
For the time being, the Free the Law project is coexisting and sharing space with the other, more traditional functions of the HLS library, and trying not to interfere with the students who use the space—for example, the noisy power cutter is typically not used during prime study hours. According to Chapman, people are “tolerating and getting used to the library as a production center and not just a quiet study space.” Chapman also said that it is useful to conduct the scanning project on-site, where it is visible to community members, as it “showcases what a modern library does” and provides “a magnet where faculty and students see it, ask questions, and get involved.”
Turning paper to pixels is only one step in making this case law digital. HLS and Ravel are collaborating with an external vendor, Innodata, to transform the TIFF and JP2 images into machine-readable XML files. The Innodata processing will extract individual cases from scanned images of entire volumes, and will redact copyrighted publisher-created annotations and any other editorial content. Innodata also runs additional quality control checks on the digital objects. This step will make it possible for users to search the text of the scanned cases and make use of them.
As of late November, the HLS scanning side of this endeavor was nearly one-third complete: more than 14,000 volumes and 13 million pages had been scanned and pushed through to Innodata for further processing. Innodata has completed its formatting of all of the California state cases and has returned them to HLS as text-searchable digital objects. Ravel intends to make case law available as it comes through from HLS, and the California law should be online by the end of 2015. New York cases are next; Delaware, Massachusetts, Illinois, and Texas are close behind.
HARVARD’S PARTNERSHIP WITH RAVEL
According to Adam Ziegler, manager of special projects at the Innovation Lab, the collaboration between Harvard and Ravel came about because HLS had “the vision and the content”—in the form of printed case law—to make the law digitally available, but not the financial and technological means to make this a reality. Ravel is responsible for designing and managing the online output. The raw case law material digitized by HLS and Innodata will be added to the Ravel platform, which allows users to view case connections and use data visualization tools to pinpoint influential cases on a given issue. As Steve Chapman, Manager of Digital Strategy for Collections at Harvard Law School, described it, “anybody who chooses to access Ravel has a means to engage with the law.”
Under the Harvard-Ravel agreement, Ravel is paying all of the costs of digitizing case law. HLS owns the resulting data, and Ravel has an obligation to offer free public access to all of the digitized case law on its site and to provide non-profit developers with free ongoing API access (Ravel may charge for-profit developers). Ravel will have a temporary exclusive commercial license for a maximum of eight years.
For the duration of that commercial license, there will be a restriction on bulk download of the case law, with some notable exceptions. Harvard may provide bulk access to members of the Harvard community and to outside research scholars (so long as they accept contractual prohibitions on redistribution). In addition, if a jurisdiction has committed to publishing its future decisions online, then all case law from that jurisdiction will be available for bulk download—Illinois and Arkansas meet this condition, and it is hoped that other states will join their ranks. The Free the Law project is working to encourage courts to create content in a way that is compatible with digital preservation and free public accessibility.
The online Ravel platform offers tools that permit users to read, contextualize, and interpret case law. Ravel’s free, “open” plan—available to those who register with an e-mail address—permits users to search case law (with the body of searchable information expanding as Harvard provides digitized materials) and use basic case reading, annotation, and search visualization tools. Search visualization provides case “maps” showing the relationships between cases addressing a particular topic, permitting users to determine a given case’s authority and relevance. Cases in a map can be filtered by court or by relevance, and are sorted in chronological order. Users may also make their own highlights and annotations in the case text. There are other Ravel features available only to users who upgrade to “Advanced” or “Elite” plans. Case analytics offered under the Advanced plan, for example, place margin notes alongside the text of a case, explaining how often and in what manner each page of a case has been cited in subsequent case law.
CONTEXT AND CONNECTIONS
Currently, the main focus of the HLS team is on preserving case law in a digital format and providing widespread access to raw material. This is intended to lead to creativity and advances in legal research, scholarship, and technology, with the output from the Free the Law project morphing and improving as people engage with the material. As Ziegler told LJ, it is “really important to us that this becomes the foundation of an ecosystem for people to do really interesting things.” John Palfrey, former vice dean for library and information resources at HLS and author of BiblioTech: Why Libraries Matter More Than Ever in the Age of Google, echoed this sentiment, emphasizing that the digitized case law can serve as a “platform for innovation in the future” and that we “cannot predict what will actually happen [with this information] in the future—the unknown is part of the excitement.”
Making the eventual collection of case law accessible to users through searches and indexing is its own challenge, independent of HLS’s ambitious scanning operation. Jake Heller, CEO of legal research and analysis technology start-up Casetext, observed that “making the law free is only step one” and “the text of the case is only the beginning…the interesting question is how you make the context around the law available.” Part of the benefit of expensive databases like LexisNexis and Westlaw is the availability of tools to find connections between cases and determine their relative authority and importance. HLS’s work, Heller said, is a first, “really important step” toward freeing the law, but it is not the end of the process.
Brian Sheppard, an Associate Professor at Seton Hall School of Law, cautioned against “people measuring the quality of the [Free the Law] database by its breadth.” While the vast historic span of the HLS collection is a “wonderful gift to legal historians,” he said, what matters for true “access to justice” is an easily searchable interface that presents information in a way that “narrows the difference between a lawyer and a non-lawyer.” Ravel’s platform of data visualization and search tools will be useful for making sense of the scanned cases, according to Sheppard, so long as helpful functions are made available along with the free case law (and not limited to paying users).