November 21, 2017

Open-Source Email Archiving Software Expands with IMLS Grant

ePADD logoThe ePADD open-source email archiving and processing platform developed by Stanford University Libraries was awarded a $685,000 National Leadership Grant by the Institute of Museum and Library Services (IMLS) on August 31. The software “supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives,” according to the project site. “Email archives present a singular window into contemporary history; however, they are often inaccessible to researchers due to screening, processing, and access challenges, as well as the sheer volume of material.”

With funding from the National Historical Publications and Records Commission, and collaboration by Sudheendra Hangal, creator of Stanford’s Memories Using Email (MUSE) software, the team at Stanford Libraries’ Department of Special Collections and University Archives completed the proof-of-concept phase of ePADD’s development in July. This new grant from IMLS will fund the project for an additional three years, enabling the developers to enhance ePADD’s usability, scalability, and feature set, in partnership with the University of Illinois Urbana-Champaign, Harvard University, University of California, Irvine, and the Metropolitan New York Library Council (METRO).

“One of the reasons we created this software is that there were very few tools that would actually allow us to review and process a collection to make it discoverable and accessible,” Glynn Edwards, ePADD project director and head of technical services for Special Collections at Stanford University Libraries, told LJ. “For this software platform and our project that was our main goal.”

Optimized for Windows 7 (with Java 7 or higher) and Mac OS X, and designed to process email in both mbox and EML formats, ePADD includes separate modules for the appraisal, processing, discovery, and delivery of large email collections.

Confidentiality concerns

ePADD Appraisal module screenshotThe appraisal module allows donors and archivists to review collections, including attached documents or photos, prior to transferring files to an archival repository. This includes tools that enable users to search for potentially sensitive content, such as credit card or social security information. If an institution or a donor has used specific formats for other types of sensitive information—such as a string of characters in a faculty identification card, or any unique, regular expressions that signal confidential communication within the collection—archivists can add those to ePADD’s automated processing forms. Email messages can then be flagged, annotated, or restricted individually or in bulk prior to processing.

These features will help archivists make email collections as accessible as possible, while ensuring privacy and confidentiality for donors and third-parties discussed in correspondence.

“Depending on the email archive…there may be issues with [the Family Educational Rights and Privacy Act] FERPA or [the Health Insurance Portability and Accountability Act’s privacy rule] HIPAA, or state or local statutes around privacy and confidentiality,” said Josh Schneider, ePADD community manager, and assistant university archivist for Stanford University’s special collections and archives.

“In archives, you need to work a lot with donor privacy as well as third-party privacy. So, we wanted to make a tool that had functionalities in it that let people easily search for potentially private or confidential information and take actions on messages containing that information, and do it in a way that really lends itself to bulk actions to deal with the volume [of email collections].”

During ingestion, the ePADD Processing module conducts several automated processes, Schneider said. For example, “it takes the various names and email addresses associated with a particular individual and it concatenates those, [resolving] the name of an individual,” Schneider said. “It also identifies and extracts named entities in the archive,” using OCLC FAST to search Library of Congress subject headings, as well as the LC Name Authority File, DBPedia, and the Virtual International Authority File. “Persons, organizations, or locations that are mentioned in the subject line or body of the mail message, ePADD identifies those and extracts those. And a lot of the advanced browsing functionality and search functionalities that ePADD does depends on that early activity of extracting those named entities.”

Search and discovery

The discovery module runs on a web-server, enabling remote users to search an archived email collection using a browser, with full-text access limited based on the donor’s wishes or an institution’s policies. Remote users must contact the host institution to request access to specific full-text messages or attachments. For example, Stanford’s own ePADD discovery module for the library’s Robert Creeley email archive enables browsing, searching, and graphing by named entities, but redacts all other text from each message.

ePADD graphing tools“Because of policies at Stanford, we are only delivering the extracted entities—the persons, places, locations, and organizations,” Edwards said. “Within the body of the email message, you can see the extracted entities, but you won’t see the full text of the archive, nor do you see the domain for the correspondent’s [email].”

Searching can be limited to incoming or outgoing messages, and a bulk search query box enables users to search a block of text to match against a collection’s entity index. Graphing tools enable users to visualize how often specific people, organizations, and locations were mentioned within the archive, and when those entities were mentioned.

“It gives you really clean data, in which to see the top correspondence over time, or the top topics that have been discussed over time in the account,” Schneider said. By contrast, most email programs only facilitate discovery by searching. “You can’t go into, say, Gmail, and identify the top 10 people you corresponded with between 2005 and 2010, which locations were most discussed…those aren’t questions that most email programs can handle. But, ePADD, because it’s doing some indexing of the messages at ingest, is able to answer some questions like that.”

On-site access

ePADD attachment viewerIn contrast to the discovery module, the delivery module enables archivists to provide moderated full-text access to a processed email collection, typically in an on-site reading room. In addition to the searching and graphing functions of the discovery module, on-site users can generate complex tiered searches using a customizable lexicon, and explore images and other email attachments within the collection. Users can also request copies of messages or attachments using a “checkout cart”-type feature.

Users can download ePADD and a detailed user guide from the project website library.stanford.edu/projects/epadd. The site also features a community resources page where new users can seek help, share expertise, or contribute a use case.

While installing the discovery module on a web server will likely require the help of an institution’s IT department, Edwards said that the software is otherwise flexible and scalable enough that interested users can download it to their personal computers to explore its features using their own email accounts.

Schneider encouraged archivists at other institutions to check out the free software, noting that, “we’re doing our best to try to promote [ePADD] as a community resource. It’s open source, and we’re interested in getting use cases and really developing a community of practitioners around the software.”

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Senior Editor, Technology for Library Journal.

Share