While archives and special collections are receiving a growing volume of born-digital content, the library field has not yet established a set of common tools and best practices for preserving email correspondence. Toward this goal, the Andrew W. Mellon Foundation and the U.K.-based nonprofit Digital Preservation Coalition (DPC) in November 2016 announced the formation of a Task Force on Technical Approaches for Email Archives led by cochairs Christopher Prom, assistant university archivist/Andrew S.G. Turyn Endowed Professor at the University of Illinois at Urbana-Champaign, and Kate Murray, IT specialist in the Technology Policy Directorate at the Library of Congress (LC).
“The technical issues around email preservation are compounded by the sheer scale of the collections,” Murray explained in an announcement regarding the task force. “Many of us have thousands, sometimes tens of thousands, of email messages in our mailboxes. Solutions need to move beyond the boutique and one-off to community supported, large-scale, and customizable options.”
Prom and Murray both became involved with the task force at the request of Donald Waters, senior program officer for Scholarly Communications for the Mellon Foundation. At LC, Murray researched email formats as a coauthor of the library’s “Sustainability of Digital Formats” site and was a coordinator of the 2015 Archiving Email Symposium. Prom had worked with DPC on an earlier project in 2010 and 2011, in which he researched “standards specifically related to email transmission, receipt, storage, [and] tools, and projects that were attempting to preserve correspondence in email form,” he told LJ.
One of the recommendations of “Preserving Email,” Prom’s report to DPC, was that cultural heritage agencies, grant funding agencies, libraries, archives, and other groups needed to work together to develop tools to enable more seamless workflows and management practices for email preservation.
“Email…really is a successor to the types of private correspondence that people sent on behalf of businesses or organizations or their own personal lives in the past,” Prom said. “I continue to feel very strongly that the historical record is incomplete if we don’t have access to some of these private correspondences, and preserve those in the same way that we have preserved letters and [typescript or handwritten] correspondence.”
“We’ve talked in our task force about how there’s no longer a storage locker in your attic with your grandfather’s letters home from the army,” Murray added. “All of that communication is via email now.”
Yet email poses several technical challenges, which Prom outlines in his report. An email account typically “contains records of disparate context, structure and content, documenting activities both mundane and extraordinary,” a muddle of letters to friends and family, notifications and private correspondence with businesses, newsletters, spam, and more. Individual messages often contain attachments of disparate format and content, the report notes, such as digital images/photos, PDFs, and audio or video files, to name a few examples. And, in addition, messages often contain links to external content, including web pages or videos.
“Simply capturing and preserving the bits that comprise a message is challenging enough, but further steps are required if the entirety of the message, including attachments, is to be accessible in the future,” Prom writes. “Since each email message includes a small amount of structured data (the header) along with a mass of unstructured data (the body and the attachments), preservation actions can entail a degree of complexity far beyond other typical digital preservation activities, such as migrating a homogenous set of documents, images or audio recordings.”
Other challenges include the scale of many email collections, the proprietary nature of email systems, and complications involved with accessing password-protected content, Murray noted.
Also, while the management of letters and personal papers has always involved issues of privacy, confidentiality, and access management, those issues are magnified with these large digital collections.
“There’s both sensitive information and personally identifiable information,” Murray explained. “So, when a donor donates their email to an institution, obviously they’re the party of interest, but they may be corresponding with people—either family members or other people of their acquaintance—who do not want to have their email archived permanently. The challenge of natural language processing to identify what those sensitive issues may be and then redact them…is certainly something we think about.”
Digital formats also can make donors more sensitive to concerns about privacy because “it seems much more reproducible,” Prom said. One of the primary goals of the project, he added, is to recommend practices and tools that will help archivists and curators develop trusted relationships with donors regarding this content.
“Using natural language processing software, or entity extraction software, or other advanced machine learning techniques not for their own end, but to allow archivists and curators to sit down with people, look through records, pull out parts that they might want to restrict, and give them some level of confidence that the repository is acting in a responsible fashion,” Prom said.
Connecting the experts
In addition to Murray, Prom, and DPC, participants in the task force include experts from Google; Microsoft; Stanford, Harvard, Yale, and Columbia Universities; the University of Manchester, UK; the National Archives and Records Administration (NARA); the Rockefeller Archive Center; the Smithsonian Institution Archives; the University of Texas at Austin; the University of Arizona; the University of Michigan; and Artefactual Systems.
The task force will build on prior reports, projects, and other groundwork laid by these experts and their institutions, such as the NARA Capstone approach for preserving emails of leaders in government agencies, the Smithsonian Institution Archives’ DArcMail project, the MeMail project at the University of Michigan’s Bentley Historical Library, and the ePADD platform developed by Stanford University Libraries.
Prom also singled out recent work that archivists at Harvard have done on the topic, such as a tool grid listing various tools that can be used for email preservation, as well as reports on technical issues, implementation strategies, and policy recommendations including the “Email Archiving Systems Interoperability Report,” published in July 2016.
In the coming months, the task force will focus on articulating a technical framework for email preservation, examining how existing tools fit within that framework, and identifying any missing elements, according to the announcement. The group will then prepare a report “with recommendations for specific actions that archivists could take within five years to create, preserve, and provide access to records of electronic correspondence.”
Email preservation is becoming a pressing concern “because it’s arriving on our doorsteps,” Murray said. “We’ve all heard about the digital deluge for years and years, but now it’s actually here. And when curators and archivists are talking with donors about their collections, email is a rich source of data that collecting institutions would like to get. So, it’s arriving, but we have a gap in our toolsets to be able to ingest that material, make decisions about personally identifiable and sensitive information, and then try to make that data accessible.”