November 21, 2017

Adam Matthew Enables Full-Text Search of Handwritten Manuscripts

Adam Matthew Digital's Colonial America Handwritten Text RecognitionAdam Matthew Digital last month announced the launch of Handwritten Text Recognition (HTR), an artificial intelligence (AI) technology that enables full-text searching of digitized, handwritten manuscript collections.

“It continues to return really remarkable results on even poor quality hand writing,” Glyn Porritt, head of Technical for Adam Matthew, an independent subsidiary of SAGE, told LJ. “We have undertaken research on samples of material and our estimates are an equivalence of 90 percent accuracy.”

Handwriting recognition technology has been used for more than two decades for purposes such as signature verification at banks and mail sorting at post offices. However, the handwritten address interpretation systems used to sort mail can narrow and validate results by relying on zip code directories and databases of known addresses. Modern notetaking apps on tablets and smartphones use machine learning to adapt to the handwriting of a device’s most frequent user. HTR is unique because it has neither of these advantages—the system is working with collections of source material written in a variety of scripts, with no pre-existing databases or transcriptions to cross-check.

Adam Matthew has been investigating the possibility of using HTR for handwritten primary source collections for a number of years, Porritt said. “However, it is in the last few years that they have seen significant progress in the development of AI technologies in this area.”

Standard Optical Character Recognition (OCR) software can fail to decipher text even in printed documents if they have uncommon typefaces, unusual spacing, stains or water damage, or fading. Even with manuscripts written in a legible, consistent hand, HTR is grappling with some combination of these issues.

“Over the years, we have certainly faced a variety of challenges regarding age and quality of typesetting having an impact on the quality of full text OCR search results,” Porritt said. “We have always invested in high quality scans, cleaned up text where necessary, and in recent years we have found software solutions to 18th century fonts and Gothic texts…. Handwriting takes this challenge to a whole new level. This is especially the case in our circumstance of working with very large volumes of manuscript material in multiple hands.”

Porritt said that Adam Matthew “had researched the prospect of providing additional [OCR] training for a certain style of hand writing or support from keyed transcriptions. However, this technology delivers search results without such additional requirements, and as a result has dramatically broken down barriers to deliver HTR for large primary source collections.”

HTR utilizes neural networks that train the software to recognize a wide variety of handwritten characters in their linguistic context, Porritt said. But, the system doesn’t generate transcripts of these source documents. Instead, search results are supported by algorithms that assess the probability of characters matching the words in a user’s search. Search results are displayed as snippets from the manuscript. Users then select a snippet and are directed to the page of the manuscript where the search result appears.

Adam Matthew launched HTR last month with Colonial America, Module III: The American Revolution, which includes “intercepted letters between colonists, the military correspondence of the British commanders in the field, as well as two copies of the ‘Dunlap’ edition of the Declaration of Independence printed on the night of the 4th–5th July 1776,” Porritt said. The complete Colonial America collection, once all five modules are released, will consist of over 750,000 pages and 160 million words of original correspondence between the British government and the governments of the American colonies, 1606–1822 (CO 5 series from The National Archives, UK), making HTR a vital tool for navigating this content.

“Manuscript volumes rarely have indexes,” Porritt noted. “Keywords and metadata have traditionally brought the researcher towards the relevant document but they then have to find pertinent areas of that work themselves. With HTR technology, the user can be taken straight to a highlighted word or words.”

The team at Adam Matthew has also begun experimenting with automated keyword lookups to flag the frequency of different terms used in the collection. “We think this is just the start of opening up a range of data mining opportunities that will continue to increase in the future as we continue to develop the great potential of this technology,” Porritt said. “There is no doubt that it has an exciting future.”

In addition to Colonial America, Florence Nightingale correspondence in Adam Matthew’s Medical Services and Warfare collection is now HTR searchable, and the company is in the process of indexing over one million pages of content for its East India Company collection, which is scheduled for release in early 2018.

“Given the enthusiastic response to the HTR functionality in Colonial America, we are keen to follow this up and further enhance the research opportunities of additional manuscript content,” Porritt said. “We will be reviewing suitable content for future collections as a priority during the rest of 2018. The Mass Observation archive, for example, is one we have noted would see great benefits from HTR searchability.”

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Senior Editor, Technology for Library Journal.

Share
Comment Policy:
  1. Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  2. Don't use obscene, profane, or vulgar language.
  3. Stay on point. Comments that stray from the topic at hand may be deleted.
  4. Comments may be republished in print, online, or other forms of media, per our Terms of Use.

We are not able to monitor every comment that comes through (though some comments with links to multiple URLs are held for spam-check moderation by the system). If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.

We accept clean XHTML in comments, but don't overdo it and please limit the number of links submitted in your comment. For more info, see the full Terms of Use.

Speak Your Mind

*