Walter E. Valero, Claudia A. Perry, & Thomas T. Surprenant describe a full-scale digital project using Greenstone open source digital library software
Tom Surprenant, son of a tugboat captain on the New York State Canal system, spent a great deal of time in his youth traveling by tug around the state. His interest has resulted in a collection of approximately 700 postcards that provide colorful insight into the canal system and coincides with his abiding desire to share the collection with a wider audience.
Little did we know that the goal of creating an electronic visual resource entitled “Waterways of New York” would become an intimate collaborative adventure. Along the way we found that a small project can be accomplished with extremely limited money, equipment, and time.
We did it with Greenstone open source software, and we think our experiences will be useful for small to medium-sized libraries looking to digitize small collections.
Novice to expert
Each of us has been involved in teaching digitization classes, coordinating a digitization special interest group, participating in a Greenstone software workshop, and/or working on a small grant-funded digitization project. We are hardly experts. Nonetheless, we all wanted to attempt the practical application of what we had learned. We also wanted to create a database that could serve as a visual resource for one of the major historical themes of New York State.
As a student, Walter Valero conducted an in-depth content-analysis of a subset of the postcard collection for a research project requirement at the Queens College Graduate School of Library and Information Studies. He quickly became attached to the cards and sought to apply his theoretical skills to the creation of a digital database.
Claudia Perry wished to gain experience with open source software and to develop further metadata skills through a real-life activity. It didn’t take long to realize that expanding upon what Valero had accomplished, pooling our skills, and collaborating would lead to the creation of a viable database and a useful teaching platform. We spent the spring 2006 semester developing a pilot program, and our experience was validated by a well-received presentation at the Digitization Special Interest Group of the Metropolitan New York Library Council (METRO) in June 2006.
Replies on a postcard
Postcards are information rich and easily lend themselves to description through a wealth of metadata. Valero chose just over 100 postcards and scanned, processed, and inserted them into the database. Since the postcards chosen were out of copyright (before 1923), a thorny intellectual property issue was avoided. No matter the content of the database, we suggest that you consider the following:
- Make sure that the collection you choose is as information-rich as possible
- Use only one two-dimensional medium; mixing media will seriously complicate the entire process
- Selecting materials out of copyright will greatly simplify intellectual property issues.
- Establish a pilot project that will help you firm up your process, procedures, and metadata creation.
The case for Greenstone
As they say, “free isn’t half bad.” A key attraction of Greenstone—as with any open source software—is its ready availability and lack of direct cost [see “Using Greenstone,” p. 6].
Besides a standard PC or Mac computer, all the hardware you need is a scanner. The technology has reached the point where you can get an excellent flatbed scanner for under $150. Some of the newer scanners are thin and have a small footprint. For more complicated projects, other options (e.g., overhead scanner, dedicated slide scanner) need to be considered. We recommend the Canon Canoscan LiDE 600F or the equivalent. It is lightweight (5.7 lbs.) and portable, which is a real advantage if you are scanning from more than one location.
The software that came with the Canon was more than sufficient to scan both color and black-and-white postcards. It loaded easily on both the PC and Mac and is highly intuitive. We also recommend Adobe Photoshop CS2 (or other recent versions). There is no doubt that this is the premier photo manipulation software and that it is a highly complex program.
We are firm believers in “just in time” learning. For our project, all that we needed to learn was how to make basic contrast adjustments, set resolution and compression specifications, and save in .tif and .jpg formats. Photoshop CS2 automatically allows you to view a lot of technical metadata that is embedded in your images, some of which Greenstone also will automatically gather and display. Eventually, you can anticipate that much more metadata will be automatically captured and displayed in digital collection management software. Having access to accurate technical information is vital over the long term; ultimately, you can save a great deal of time using this program.
In reality, basic collection development types of questions need to be answered before you begin, e.g., “What is the goal of the database?” “What is the target audience?” The “So what?” question has to be asked at every step of planning as well. This will act as a valuable reality check. We recommend:
- Select a collection that is information-rich and manageable
- Select a small segment of the collection for a pilot project that will be used to establish a production routine and validate the metadata
- Create a standard 8.3 file format naming convention (accession number—e.g., e0001ar1.tif) using all lower-case for maximum cross-platform compatibility
- Create and maintain up-to-date planning documentation and specifications based on existing best practices
- Use the Dublin Core 8.8 Metadata Schema (or latest update) in conjunction with documentation on best practices and Library of Congress (LC) Thesaurus for Graphical Materials, or develop your own thesaurus
- Create/maintain a style manual that records any decisions, with rationale, that are exceptions to recommended practices
- Appoint a quality control supervisor who will be the final arbiter (in consultation with other key participants).
Once you’ve decided to take the plunge, revisit your test database repeatedly in order to work out the kinks. We came in with a plan for the project, but the postcards had their own plan, which necessitated some changes as we went along. This is the hardest part; once the process is established, it will become a known routine, which is why a test database is so important. We recommend:
- Scan images at 600 dpi (dots per inch) and save as uncompressed TIFF files (these are your archival master files)
- Via Photoshop, create JPEG access derivatives that will be viewed on a CD or web site (this will vary with the size of your originals; for postcards, we suggest saving at 150 dpi, at medium compression, and 1000 pixels on the long side)
- Create backup files of both TIFFs and JPEGs
- Import JPEGs into the Greenstone database using the Greenstone Librarian Interface (Greenstone automatically creates thumbnail images for you)
- Add metadata; double-check for accuracy and consistency
- Have sample users review database for ease and usability
- Customize interface (optional)
- Edit/revise as necessary in an iterative cycle.
Project time line
We took 110 hours to complete our project. The steps we took are easy to list, but they don’t particularly indicate the relative time each part may take. The table at left shows how much time we estimate was required by each segment of our project. We probably underestimated time spent.
We created our own thesaurus and customized the format of the Greenstone interface; in retrospect, use of an existing controlled vocabulary would have greatly streamlined our process. Regardless of your methods, you can expect that your metadata will occupy the largest chunk of your time. Limiting the amount of detail in your metadata will reduce the time spent, but it may also limit the utility and accessibility of the database as the project expands. This trade-off should be addressed directly during the evaluation of the pilot project.
Good documentation is paramount for the success of any digitization project. We recorded decisions regarding all aspects of the operation in detail, which saved a lot of time and allowed us to move quickly when important decisions needed to be made or revisited.
While the Dublin Core Metadata Best Practices (see Link List) contains important guidance, we deviated for practical reasons. For example, the version of Greenstone available at the time of the project used an earlier version of Dublin Core, and substantial customization would have been necessary to adhere to then current standards. Each deviation was recorded along with its rationale. Attempting to apply the Dublin Core, we realized that while we should not violate the rules at every point, the Core should not be viewed as absolute. The same held true with the development of the thesaurus. A great deal of effort went into the development of a thesaurus. However, LC’s Thesaurus for Graphical Materials (TGM) loomed large in our decision process. We are now reviewing our thesaurus in order to make it more consonant with TGM. Dealing with an already existing subject set, while adjusted to meet our needs, is important to the use and expandability of our database.
Our experience illustrated the importance of the four “Rs”: review, revisit, refine, and revise. Just when we were confident about the validity of what we had accomplished, our current students brought us up short. It also made us realize that many eyes are better than just a few. In a small library setting, this means involving a user group or individuals not directly involved with the immediate project to serve as a reality check. In our case, over 35 students challenged our decisions and identified problems that we had not observed.
Among other changes, we added terms, adjusted the thesaurus to meet more closely TGM standards, and developed sample templates to ensure consistency in the creation of descriptive metadata. The students’ active questioning of the entire procedure served to strengthen the database. This concentration of effort, extending the input to include such feedback, has reinforced the value of collaboration. We recommend:
- Create and maintain good documentation for every major decision, especially as it affects the Dublin Core standards, Best Practices, and/or existing subject thesauri
- Treat the Dublin Core and/or existing subject thesauri as strong guides but be willing to make changes that will strengthen your database
- If possible, use a standard thesaurus; this will save a great deal of time
- Constantly review, revisit, refine, and revise
- Be collaborative and flexible; the more people you have reviewing what you have done and the greater your willingness to adjust, the stronger your final product will be
- Make trade-offs that will balance an information-rich environment with the time necessary to complete a project. This is why an understanding of your projected audience is so critical.
Getting the word out
After all is said and done, it is likely you will have created a database that is dynamic, not static. We began working with Greenstone to assist us in the teaching/learning process. It quickly became apparent that we should make what we created available to the public. Queens College and the Graduate Library School agreed to host the online version, “Waterways of New York” (see Link List). Fall 2006 library students have processed an additional 140 Erie Canal postcards that eventually will be added to the original database. This segment of the collection also will be hosted on METRO’s server. METRO’s generosity will allow us to provide students with experience in using CONTENTdm digital collection management software. We plan to explore the differences in searching and using the same digital content mounted on two varying software suites.
In addition, we are working with the Erie Canal Museum in Syracuse to fill in some of the geographical gaps in coverage and hope to incorporate its extensive collection (approximately 5000 cards) in the future. “Waterways of New York” covers more than just the Erie Canal.
In a cooperative project with the Ellenville Public Library and Museum, NY, we have scanned and processed 70 Delaware and Hudson Canal postcards as well. In the spring, we intend to add cards from the Champlain and Black River canals. We also are exploring collaboration with educators who will use the database as a source for teaching/learning modules at all levels. But it doesn’t end there. We hope that this database will expand with the help of postcard collectors, libraries, museums, and organizations dedicated to the waterways of New York. We expect this rich visual resource will be useful to teachers, researchers, and all those interested in the history of the Empire State.
Greenstone is a digital collection management software suite produced by the New Zealand Digital Library Project at the University of Waikato and developed and distributed in cooperation with UNESCO and the Human Info NGO. It has been widely used around the world, especially in developing countries. The Greenstone web site (see Link List) includes extensive documentation, software downloads, FAQs, and examples of digital library projects that use Greenstone. Downloading and setup on a PC is quite easy; for the Mac, it is more complicated but doable. Mounting it on a server is straightforward but requires the assistance of your system administrator. In addition to supporting a web presence, the software can be used to produce a searchable CD-ROM-based database. The number of downloads from the Greenstone site per month, at 4500, suggests that there is a substantial user base. However, getting answers to questions beyond the FAQs takes some dedication. But what else is new? Getting help with leased/purchased programs is no picnic either. We recommend you become well acquainted with the Greenstone web site:
- Take a good look at the example databases
- Download and install the program on a PC or Mac before moving to a server
- Be sure to use the documentation (From Paper to Collection and the Developer, Install and User Guides)
- Download and go through the tutorials
- Join the Greenstone users electronic discussion list.
Digital Imaging Best Practices
Dublin Core Metadata Best Practices Version 2.1
Handbook for Digital Projects: A Management Tool for Preservation and Access. NEDCC, 2000.
Introduction to Imaging, Revised Edition
Queens College GSLIS
Thesaurus for Graphical Materials I: Subject Terms
“Waterways of New York”