February 17, 2018

The Library of Congress Twitter Archive #Hmmm #Skeptic | Peer to Peer Review

For years I’ve been struggling to resolve a couple of mutually incompatible truths:

  1. Libraries are all about preserving access to valuable information.
  2. Libraries (even in the aggregate) have no hope of preserving access to anything more than a vanishingly tiny sliver of the valuable information that is created in the world.

Now, I recognize that the second statement above will seem like an exaggeration. “Sure,” I can hear some readers thinking, “no single library can keep and take care of more than a tiny sliver of the world’s useful information, but there are hundreds of thousands of libraries in the world. Surely if we all just coordinated our efforts….”

For those who are thinking this way, I recommend a glance at a 2010 report titled “How Much Information?,” which was published by the University of California, San Diego, as the result of a collaborative project undertaken by a group of corporate and academic partners. Here’s a pull quote from the paper’s executive summary:

In 2008, the world’s servers processed 9.57 zettabytes of information, almost 10 to the 22nd power, or ten million million gigabytes. This was 12 gigabytes of information daily for the average worker, or about three terabytes of information per worker per year.

A couple of things to bear in mind while looking at those numbers: first, 2008 was five years ago, and the pace of information production has not slowed down; second, these calculations include only the subset of the world’s information that passes through enterprise servers—i.e., computer systems run by companies, offices, universities, etc.

Let’s say, for the sake of argument, that only one byte per gigabyte of this information is actually valuable. That would mean (assuming I’ve sorted my decimal places correctly and if I haven’t I hope a commenter will correct me) that for libraries to preserve access to the world’s output of useful information, we would have to capture, organize, and preserve something more than 100 million gigabytes of information each year. Even if preservation on that scale were feasible as a distributed worldwide project, the prospect of managing and coordinating that project is overwhelming.

So here we are, back at the uncomfortable truth: we can’t handle more than a tiny, tiny sliver of the world’s information output. What does this imply?

One thing I believe it implies is that we might want to raise an eyebrow at the Library of Congress’s (LC) decision, a few years ago, to become the permanent archive of Twitter.

In a white paper released in January 2013, LC laid out its rationale for taking this step. Sort of. In the paper, LC explained “Why the Twitter Collection Is Important to the Nation’s Library” thus:

As society turns to social media as a primary method of communication and creative expression, social media is supplementing and in some cases supplanting letters, journals, serial publications and other sources routinely collected by research libraries.

Archiving and preserving outlets such as Twitter will enable future researchers’ access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.

Well, no argument there. The problem is that this statement answers an easy question (“Why is a Twitter archive worthwhile?”) but avoids the harder question, which is, “Why is a Twitter archive more worthwhile than the other projects that will not be undertaken because limited resources are being directed to the Twitter archive?”

When I ask myself why we should archive Twitter, I come up with several possible answers, none of which is very satisfying to me.

One possible answer is: because we can. (But think of what we can’t archive because we’re archiving Twitter.)

Here’s another: because we really don’t know what will and won’t be useful in the future. Why exclude such a rich trove of content from our national archive when much of what looks useless today could turn out to be tremendously useful tomorrow? (But we could also say that about all the non-Twitter outputs that we’re not archiving.)

And another: because, taken in the aggregate, the Twitter stream reveals interesting and useful information about how and when certain topics become important in public discourse—despite the apparent banality of its individual components. (But is Twitter really the best source of this information, and therefore the source most worthy of LC’s limited resources?)

And another: because what’s happening now with Twitter will have serious ramifications for what happens with the future of public discourse. (This may well be true, and if so it suggests that Twitter is an important topic of study. It does not, however, suggest that every tweet needs to be archived—or, more to the point, that it needs more urgently to be archived than the other things we could be archiving with those resources.)

Please understand: I’m one of the many people who thinks it’s remarkably cool that LC is creating a Twitter archive. But I would also think it was cool if the Library of Congress created a comprehensive clawhammer banjo collection or archived every commercially released reggae recording. The bottom line is that coolness isn’t the same thing as importance, and importance is a less relevant property than relative importance. In other words, the urgent question isn’t whether this project has value (of course it does) but rather “where on the priority list of valuable projects should this one fall?”

Of course, without knowing how much LC is spending in terms of labor, overhead, and capital equipment on this project and without knowing what other opportunities are waiting in the wings for the privilege currently being accorded the Twitter archive, it’s very hard to form a reasonable opinion about whether LC is using its limited resources wisely in this case. On August 7, I sent an email to LC’s director of communications, asking these two questions:

  1. Can you tell me how many full-time equivalent positions (or maybe how many person-hours) per week are dedicated to managing the Twitter archive?
  2. Can you give me an idea of the server capacity that is currently dedicated to managing this archive? What percentage (an estimate is fine) of LC’s total server capacity is absorbed by this content?

So far I’ve gotten no response. If and when I do, I’ll pass the answers along via the comments section. In the meantime, I’ll remain somewhat skeptical. Archiving Twitter is clearly a sexy and headline-grabbing move, as well as one that will probably offer real value to future scholars and researchers. I just wish I were more confident that it will offer more value than some of the other things LC could have done with the same resources.

Rick Anderson About Rick Anderson

Rick Anderson (rick.anderson@utah.edu) is Associate Dean for Collections & Scholarly Communication at the University of Utah’s J. Willard Marriott Library. He serves on numerous editorial and advisory boards and is a regular contributor to the Scholarly Kitchen blog. He currently serves as president of the Society for Scholarly Publishing, and a collection of his essays titled Libraries, Leadership, and Scholarly Communication was published this year by ALA Editions.

The Latest Trends in Library Design
Hosted in partnership with Salt Lake County Library and The City Library—at SLCo’s Viridian Center—the newest installment of our library building and design event will let you dig deep with architects, librarians, and vendors to explore building, renovating, and retrofitting spaces to better engage your community.
Facts Matter: Information Literacy for the Real World
Libraries and news organizations are joining forces in a variety of ways to promote news literacy, create innovative community programming, and help patrons/students identify misinformation. This online course will teach you how to partner with local news organizations to promote news literacy through a range of programs—including a citizen journalism hub at your library.


  1. Matt Enis Matt Enis says:

    Wouldn’t you agree, though, that Twitter has been instrumental in organizing and documenting the Green Revolution in Iran a few years ago, the protests that led to the recent overthrow of the Egyptian government, and several other key events during the Arab Spring?

    I agree with your comment on “the apparent banality” of a lot of tweets that originate in stable western democracies. But during the past four years, it has already become a hugely disruptive communication tool for people living under governments that attempt to restrict the flow of information.

    • Rick Anderson says:

      Matt, I certainly would agree. Documenting the disruptive effect of Twitter on governments that attempt to restrict the flow of information is one of many valuable impacts of LC’s archive. The problem, though, is that you can’t (or at least shouldn’t) allocate scarce resources based only on value arguments. You also have to take into account costs. So the relevant question here is not “Does the Twitter archive provide a valuable window on [important topic of your choosing]?” The relevant question is “Is this the best way for LC to allocate its scarce resources, given the alternatives?” The answer to that question may well be yes. But it’s hard to say, without knowing how resources are being allocated to this project and what other projects are in the backlog.

      We in libraries have an unfortunate tendency always to argue from value, which is understandable–it’s always possible to come up with a reason why everything we do is valuable. If resources were unlimited, this would be the right way to think. But they aren’t, so we have to make hard choices, and we won’t make those choices well unless we take costs (including opportunity cost) into account as well.

  2. Rick Anderson says:

    I did hear back from Gayle Osterberg at LC, with answers to both of my questions. I repeat them below; her responses are in italics:

    1. Can you tell me how many full-time equivalent positions (or maybe how many person-hours) per week are dedicated to managing the Twitter archive?

    There is not a budget or FTE figure associated with any one specific archive as our personnel and technology resources span all the Library’s collections. Building the archive itself has been set up to be almost entirely automated, so very little manpower is required to ingest the tweets. Where we are focusing effort now is determining what system can be developed to provide meaningful research access to the archive, at a cost that is reasonable for a public institution. There is a team of three Library staff who work on this project, among other work.

    2. Can you give me an idea of the server capacity that is currently dedicated to managing this archive? What percentage (an estimate is fine) of LC’s total server capacity is absorbed by this content?

    Less than 1 percent of the Library’s server capacity is utilized by the Twitter archive.

    As well, since I have had the benefit of reading your piece, I would add that the value of the Twitter archive to the Library goes beyond Twitter only. It is also a valuable pilot project in helping the Library evaluate big data collections and their potential use.

    (I particularly like Gayle’s point about the value of the Twitter archive as a test bed for future big data projects. That’s an upside to the initiative that I did not consider.)

  3. Tony Aponte says:

    Was wondering what other alternatives/things do you think LC could be doing with the resources devoted to the Twitter archive?

    • Rick Anderson says:

      That’s a fair question, Tony, and of course I can’t really answer it without knowing what kinds of materials and collections are in LC’s backlog awaiting digitization. It’s possible that there are no analog collections in their vaults that are more important than the Twitter archive. I kind of doubt it, but the fact that I don’t know for certain is the reason I label myself “skeptical” about this project rather than an opponent of it.