Most academic librarians stepping into a position can model their work on that of their predecessors. But not Thomas Padilla. On his appointment in April as the first humanities data curator at the University of California, Santa Barbara (UCSB) Library (and the first in the entire University of California system), Padilla has had to draw on a number of different disciplines to shape his role of working with data throughout its life cycle, creating a support plan for digital humanities researchers, and providing research data consultation. Formerly the digital scholarship librarian at Michigan State University Libraries, Padilla is pioneering a new niche for academic librarians—and one that may become increasingly common.
LJ: What do you do as humanities data curator?
Thomas Padilla: I get that question a lot. To a certain extent it’s still a role that is evolving. I see it as a combination of a digital humanities and a data curation librarian. I emphasize the digital humanities component because I think that makes it easier to engage with researchers, faculty, and students. Having a disposition toward not only the curation, but also the acquisition, analysis, and representation of the data, is really helpful because it provides different opportunities for engaging people in the course of exploring or formulating questions.
In terms of what that looks like in a concrete way, it could be any number of things, from helping [researchers] acquire materials to evaluating the structure of the data to see whether it affords the possibility of asking certain types of questions. So if, for example, they wanted to do a mapping project, or a network analysis project, part of the role is helping them understand the affordances of the content they have in hand. And part of that is translating their individual disciplinary competencies so that [they have] purchase in a digital environment with materials they may not be accustomed to working with.
On the data curation side, it’s about helping researchers understand how to document the process of working with data to explore certain types of questions, so that other researchers can understand what they’ve done.
How did you come to work in digital humanities and data?
I was halfway through a graduate degree in history [at San Francisco State University] and I thought that I might be interested in archives or libraries as an additional career path. I was able to get an internship through the Hispanic Association of Colleges and Universities (HACU)…doing archival processing for a special collection at the National Archives in St. Louis, and I got bit by the archives/libraries bug.
Through the same organization I was able to get a series of internships, one at the Library of Congress (LC) in the Educational Outreach division, and then a couple more in a program [at LC] called Digital Preservation Outreach and Education, which eventually was converted to a full-time position. I think it was that nexus of my humanities background and…exposure to digital archives and digital preservation work that led me to library school. I went to the University of Illinois, and had a number of other opportunities there to do digital preservation research.
At Michigan State I started as digital humanities librarian, and the liaison to linguistics and philosophy. Eventually my role was broadened to digital scholarship writ large. And now I’m here.
How does the treatment of humanities data differ from science data?
It’s an interesting sort of tension, because in the sciences a lot of the conversation around documentation focuses on the notion of reproducibility in research. But that sort of paradigm doesn’t always map to the humanities. One of [the researchers at UCSB] brought up the notion of repeatability. It’s not necessarily that another researcher should be able to reproduce your exact results, but that whatever you’re describing, someone else should be able to do that [via their own path].
So it’s helping [researchers] understand how to document their process, the transformations, communicate the choices that they’ve made in manipulating the data or creating a certain type of resource so it’s understandable by other people, and helping them weigh some of the factors that might affect their argument.
An additional component is considering what needs to be done to make that data continue to have value after a particular project. It’s a natural line from the documentation angle to discussion about metadata and file format sustainability, and selection of repositories that provide persistent access.
Working with data also involves ethical issues. What sorts of challenges have you come across in your work?
Trying to help researchers work through what might happen when they decide to clean data in a certain type of way—for example, in the push toward a certain type of visualization or representation, how they might maintain the diversity that’s represented in the data while still reaching their end goal.
I immerse myself in what those issues might be, and how they square with certain types of questions that a scholar might want to ask. Then I help them through the process of identifying those challenges.
The other component of the ethical argument [is]related to potentially capturing information that is questionable, whether or not you have the right to capture and share it at will—trying to explore those questions and bring them to the fore. Some of those are teaching opportunities [such as] Twitter content—an exploded view, a conversation about all the different data that’s represented in each tweet in addition to the 140 characters: geolocation, relationship networks among different actors, things like that.
I’d like to make more immediately transparent and identifiable all the labor that goes into the creation of digital library collections, or datasets that are used by digital humanists in their research. That’s more of an ethics tied to labor—trying to recover that and acknowledge it. Because it’s a lot of work, and it’s a significant intellectual contribution that can go unnoticed.
You’ve mentioned Documenting the Now as a project with a lot of interesting components. Could you elaborate?
It’s such a vital contribution. When I was at LC one of the first things that really sparked my imagination was the notion of web archiving. In my historian mode, I was trying to consider what historians 30, 40 years from now would have to study if they wanted to know about contemporary life at the turn of the 21st century. What would their primary sources look like? I remember thinking that the various web archiving initiatives were hugely important—and still are—because they’re aiming to make sure those primary sources are there for future generations to study. Social media content like Twitter, for example, has become increasingly important and operates in this liminal space. So something like DocNow focusing on capturing that content, but also with an eye toward ethics and community engagement, is super important.
Another organization that I think is important, and operates in a similar space, is called WITNESS. A lot of [its] advocacy focuses around trying to empower various community organizations and activists throughout the world, oftentimes that are resisting state-sponsored suppression, to capture video documentation of abuse in such a way that it protects the identity of the people that are captured but still stands up in a court of law.
How do you think DH practices and the expanding role of data are going to change collection practices in academic libraries?
I would hope for increasing recognition of data produced by humanities faculty as potential acquisitions. To some extent I see a clear line of connection between existing [collection development policies], either in subject areas, special collections, or university archive collections [and such data acquisition]. We have a precedent for some of this work in faculty papers acquisition, where increasingly we get whole computers and lots of email.
That’s where we see the archives community really innovating in this space. Trying to figure out how to deal with email, for example: how do you process, arrange, describe, and provide meaningful access to thousands of emails? Peter Chan at Stanford has a project that leverages a number of different tools to do processing of email. Also Mary Elings’s group at the library at Berkeley has developed ArchExtract, which does something similar.
A second component is tied to reconsidering practices for how we document the digital library collections that we have so they can be more easily treated as data that can be studied at scale. What might need to change if you were to provide bulk access to a large collection versus item-by-item access? I’m informed by the work of Miriam Posner, Roopika Risam, and Victoria Stodden. Those three form a nice framework of consideration for how we might think about documenting and providing access to that content.
In a pragmatic way, it forces us to think about the formats we provide content in. For example, it’s relatively traditional to generate a bunch of PDF images of archival resources. But we could also generate a whole bunch of plain text documents, just the OCR derivatives, for people interested in doing text analysis. And the access components—is it possible to allow people to create subsets of the overall collection based on certain parameters and download all of them at once? Should we provision an API? Some libraries do this—the University of British Columbia has a beautiful digital collections interface that has an API, and even has an API query builder built into it, so it has something in-browser that can teach a user how to use it to interact with their collections, which is kind of neat.
A lot of our digital library design over the past 20 years has been focused on emulating analog item-by-item interaction. You have image zooming, you have page turners, and things like that. But it’s not really helpful for someone who wants to download 10,000 books or something larger.