November 21, 2017

Developer Maps Library Photo Archives

Old NYC pagePortions of the New York Public Library’s (NYPL) “Photographic Views of New York City, 1870’s–1970’s” collection have been available online for several years. But views of the collection have mushroomed during the past two weeks, thanks to the launch of OldNYC.org, a website that overlays photo locations on a Google Maps interface, enabling visitors to explore the collection by zooming, dragging, and clicking their way around an online map of the city. The new site was independently created by software engineer Dan Vanderkam using the Google Maps API, data provided by NYPL, and open source photo and text extraction programs that he wrote himself and has made available on GitHub.

“We love this project…. We played a supporting role, but this is totally the work of [Vanderkam], extending the spectrum of knowledge creation that the library has always supported,” said Ben Vershbow, director of NYPL Labs.

Vershbow added that the map interface “is immediately familiar to everyone. I love that, because that photo collection has been online for a decade. It’s not like it’s a new digitization project. It’s something that’s been up there online all along, and it’s been used, but now [usage] is going crazy.”

Within a week and a half of OldNYC’s May 20 launch, the site had been shared on Facebook more than 24,000 times and had drawn coverage from major news outlets including the New York Times, The Guardian, Newsweek, and Gizmodo. Vanderkam told LJ that millions of photographs have been viewed on the site.

“I think people are finding old photos of their apartment or their office, or someplace that they grew up, and they share it with friends of theirs, and it spreads,” he said regarding the site’s popularity.

Older San Francisco

OldNYC.org is the second site that Vanderkam has developed for navigating a library’s collection of digitized photographs. In December 2008, when he was living in San Francisco and working for Google, Vanderkam searched the San Francisco Public Library’s (SFPL) online San Francisco Historical Photograph Collection for old photographs of his neighborhood. Using nearby intersections as search terms, he managed to track down a photo that had been taken from his own roof 63 years earlier.

“It was really amazing to look at—to see the changes that had happened [to the neighborhood] over the years,” Vanderkam said. But he also noticed a problem—one of the streets that defined the photo’s location and made it searchable by keyword was actually mislabeled.

“It was one intersection over,” he said. “If the photo had been correctly labeled, it never would have come up in the search results for me. So it struck me at that point that using keyword search is not the right way to browse this [collection]. Really, the right way is a map, where you can look at your block, and then start looking around it for other areas that you’re familiar with.”

And so, during the next 18 months, Vanderkam built OldSF.org in his spare time, using Google’s Geocoding API to map about 13,000 photos from SFPL’s collection. By the time he felt the site was ready to go public in 2011, he was living in New York.

“It seemed like the natural thing to do, to build a New York version,” Vanderkam said.

With OldSF.org working smoothly, Vanderkam assumed that it would be a relatively simple matter for him to use the Geocoding API and Google Maps interface to create a similar New York City site for the NYPL’s collection. And things seemed to be going well when NYPL Labs emailed him a spreadsheet with cross street information and other metadata for all of the collection’s 30,000+ images that had been digitized.

He had a version of the new site up and running within two weeks, “but it was pretty bad,” he said.

Troubles mounting

Unlike the SFPL’s collection, the original photographs in the NYPL’s collection are all preserved on uniform 15×11 inch manila mounts or acid-free archival boards, typically with the address or location inscribed on the front of the mount, and the context, date, creator, and other information captioned on the back. This was ideal for interfiling when the photos were used as a physical resource, and the uniform size of the mounts later simplified digital capture of the collection, according to NYPL. However, while Vanderkam was able to map the images and display them, the images themselves—particularly in thumbnail form—tended to be overwhelmed by a huge manila background.

Irma and Paul Milstein Division of United States History, Local History and Genealogy, The New York Public Library. (1933 - 1933). Celebrations - Parades - Municipal events - N. R. A. [National Recovery Administration] Day.

The collection’s photos are mounted on backing paper

“Most of what you saw was just the brown backing paper,” he said. Finding a way to extract the photos would be a crucial next step, he decided. “I don’t think the site would be very compelling without it. That was the first challenge, and the first thing that made it clear to me that OldNYC would be different than OldSF.”

After considering the use of a complex computer vision algorithm with edge detection to pick out the photos, Vanderkam decided that a simpler approach would be to write his own algorithm that defines the mounting paper background in each image file, blacks that out, and then records what remains as an image. The approach proved effective, even for image files that included multiple photographs on a single mount. Vanderkam describes the process in a 2013 blog post, and he has posted the source code on his Github page for other projects that may find it useful.

Later, Vanderkam decided to try using optical character recognition on the captions on the back of the mounts, which are saved in separate image files. Upon discovering that the open source Ocropus OCR program worked best when he manually cropped out the large, blank background surrounding the captions, he wrote a separate program that applies the canny edge detector and a rank filter to these images to automate the cropping process. This is described in a more recent blog post, and the resulting source code is also available on Github.

Memories of New York

Separate from the complexity of the photo and caption extraction processes, another key difference between OldNYC and OldSF is the addition of comment widgets to the New York site, Vanderkam said.

“With OldSF, the site is read-only. You go there and look at photos, and that’s it. With OldNYC I wanted to involve users a little bit more.… If you’re looking at a photo, and you know something about a building—if you used to work there, or you know that the building was torn down or is still there but used in some other capacity, you can leave that as a comment, and everyone will be able to see it. That’s something I really want to encourage everyone to do. It makes the site better for everyone if that knowledge gets shared.”

There is an “Error? Send Feedback” that gives visitors the opportunity to help fix any problems that may have occurred during an automated process. And for photos with OCR’d captions, a “Typos? Help Fix Them” link directs visitors to the image file of the caption with a simple input box that allows them to submit corrections. Vanderkam said he’s been “thrilled” with how many people have helped already, with more than 8,000 OCR corrections submitted since the site’s launch less than two weeks ago.

“I was assuming that there would be a small enough volume that I would be able to look over them one-by-one,” he said. People don’t seem to be abusing the corrections feature, though (and all users have access to an image of the original caption).

While the ceaseless tearing down and rebuilding of New York is certainly part of OldNYC’s appeal, showing users glimpses of the past where modest shops and walkups have been replaced by modern skyscrapers, the city’s ever-changing landscape has also posed some problems for Google Maps that Vanderkam is still working to address. For example, in the 1940s, intersections between 1st Avenue and Avenue C, and between 14th Street and 23rd Street, were eliminated with the construction of Stuyvesant Town and Peter Cooper Village, a massive development of 110 residential buildings and park space.

“I’m sure those cross streets are covered in the collection, but because intersections like 18th and Avenue A don’t exist anymore, the geocoder doesn’t know what to do with them,” Vanderkam said. “Thinking about future directions for the site, that is one thing that I would like to fix—figure out how to geocode addresses that no longer exist.”

He added that it would be “fantastic” if someone with expertise in mobile apps were interested in optimizing OldNYC.org for smartphones.

Matt Enis About Matt Enis

Matt Enis (menis@mediasourceinc.com; @matthewenis on Twitter) is Senior Editor, Technology for Library Journal.

Share