This is not your father’s history.

This post was written by guest blogger David McMillen, External Affairs Liaison in the Department of Strategy and Communications.

Conventional wisdom is that the appraisal process for electronic records is the same as for paper. Richard Pearce-Moses made that statement in his 2006 Presidential Address to the joint meeting of Archivists in DC. Randall Jimerson quoted him in Archives Power.

However, conventional wisdom isn’t always all we need to know.

Buried under Richard’s statement is the assumption that research with electronic documents will be much like research with paper, and we are beginning to see signs that suggest new research methods might change the way we think about what ought to be kept.

Last December, Jean-Baptiste Michel and his colleagues at Harvard published “Quantitative Analysis of Culture Using Millions of Digitized Books” in Science. Their work has been dubbed “culturomics.” You can also read about this research in a New York Times article and at www.culturomics.org

In short, Michel and his colleagues have created a tool that allows you to examine the use of single words or phrases across 200 years of books digitized as part of the Google Books project. This can be as simple as looking at the use of terms like “the North” and “the South” during the nineteenth century:

From the corpus American English
the North vs. the South – From the corpus American English

Michel and his colleagues use the tool to look at the transformation of verbs across time such as burnt/burned:

From the corpus American English
burnt vs. burned – From the corpus American English

Or snuck/sneaked:

snuck vs. sneaked - From the corpus American English
snuck vs. sneaked – From the corpus American English

To show that Jimmy Carter was more popular than Marilyn Monroe:

Jimmy Carter vs. Marilyn Monroe - from the corpus American English
Jimmy Carter vs. Marilyn Monroe – from the corpus American English

And to show the effect of Nazi censorship on Marc Chagall:

Marc Chagall - from the corpus German
Marc Chagall – from the corpus German

They have even provided a tool where you can run your own experiments: http://ngrams.googlelabs.com/

Interesting, you say, but what does this have to do with archives?

If the way historians are going to use our collections changes, we might want to reconsider what we keep. It may be that keeping all 200 million emails from the Bush White House was not such a bad idea. It will provide a rich database for an analysis of the terms and topics discussed over those eight years. How often was terrorism discussed before September 11, 2001?

It has always been the responsibility of archivists to keep one eye sharply focused on the past, and the other with a vision of the future. That job just keeps getting harder every day, and more important too.

7 thoughts on “This is not your father’s history.

  1. I respectfully disagree with the comment, “Buried under Richard’s statement is the assumption that research with electronic documents will be much like research with paper, and we are beginning to see signs that suggest new research methods might change the way we think about what ought to be kept.”

    One of the questions appraising paper records is that the information must be in a form that will be usable. I was taught (many) years ago that observational data was often discarded if well summarized in a report. Part of the problem was that analysis of the original data was so labor intensive that few people would ever do so. (Clearly there were exceptions to this general rule.)

    Use (or potential use) has always been a factor in what should be saved. When speaking of electronic records, I have used the question of observational data as an example of a traditional appraisal question has a different answer in the digital era. Now that the data is in electronic form, it’s much easier to analyze – hence there’s a stronger argument to save it.

    This is true, even in my own research. It would have been much more difficult for me to compile the Glossary without access to rich text on the Internet, and I was able to use search results to get a sense of which of variant forms of terms was prevalent.

    I continue to believe that – at an abstract level – what we do remains the same in the digital era. How we do it – and the outcomes – will certainly be different. At the New Skills colloquium, Catherine Stollar and Thomas Kiehne, took some issue with my statement, putting a different spin on it. They argue that why archivists do what they do remains the same, but how they do it changes. I think we’re getting at the same point, but the rephrasing throws the essences into relief.

    I absolutely agree with Archivist Ferriero that this rise of digital humanities and digital scholarship offers many novel approaches for appraisal. Archivists should celebrate and welcome these new approaches, and they must become familiar with these new approaches so they can support the researchers.

    — RPM

  2. The past is easier to see than the future. We shouldn’t let our appraisal decisions be too strongly influenced by the type of media on which a record is stored and the corresponding degree that the latest research technologies can unleash rich research potential. If we allow the value we place on a given record to be reduced because the available research methodologies and tools afford access to just a portion of the information it contains, we risk a great deal. We humans have a great track record of being unable to foresee technology breakthroughs and how such future breakthroughs can then reveal even greater value in those records. Limiting the historical value assigned based on the effectiveness of the latest research techniques and technology seems a bit short-sighted.

    Perhaps you are advocating that we be more generous in appraisal evaluations in the light of increasingly available and affordable tools capable of performing analyses on a massive scale? How will the next wave of research technology breakthroughs change what we see to be valuable?

    1. Hi Ricc. Thanks for you comment. I think you and I are on the same page, and I think we both would agree with Richard’s last paragraph.
      The great value in the Twitter archive at the Library of Congress will be when we have tools like ngrams to mine the social trends revealed in the lexicon of Tweets.

  3. Archivists and their colleagues in computer science and engineering are looking at similar tools to facilitate the appraisal, description, preservation, and access to electronic records. A number of NARA’s Research Partners are developing or testing such tools.

    Our Research Partners at UNC are developing tools to map the geospatial coverage of record collections. (http://ci-ber.blogspot.com/2011/02/visualization-of-gis-records-in-ci-ber.html)

    Our Research Partners at TACC have conducted experiments using data mining and visualization tools to describe and analyze large collections (millions of files) of electronic records to help with description, access and preservation. See for example:

    Assessing the Preservation Condition of Large and Heterogeneous Electronic Records Collections with Visualization (http://www.ijdc.net/index.php/ijdc/article/viewFile/162/230)

    Enabling Data-Intensive Research and Education at UT Austin via Cloud Computing – LIFT Progress Report (https://www.utexas.edu/cio/itgovernance/lift/articles/cloud-computing-update.php)

    A Window on the Archives of the Future:
    TACC partners with the National Archives to find solutions to the federal government’s digital records challenge (http://www.tacc.utexas.edu/news/feature-stories/2011/a-window-on-the-archives-of-the-future/)

    1. How has NARA leveraged all that research? Are there examples of direct, practical application of the research to NARA’s appraisal, description, preservation, and access programs?

      1. Hi JC,

        So far the research I referenced above is in its early days, but previous research has used extensively at NARA. For example, the original requirements for the Electronic Records Archives (ERA) were derived from research NARA supported at DARPA and the San Diego Supercomputer Center (SDSC).

        Our Research Partners at GTRI have developed a number of tools that have been put to direct use at NARA. For example they have developed tools for automated content summarization, document type identification and redaction that NARA uses. (See http://perpos.gtri.gatech.edu/publications/TR%2009-05-Final%20Report.pdf) and (http://www.archivists.org/conference/sanfrancisco2008/docs/session505-Clement.pdf)

        GTRI’s work with automated file type identification is leading to improvements in DROID (PRONOM) one of the most widely used tools for file type identification. NARA uses this technology in the ERA system. (See http://blogs.archives.gov/online-public-access/?p=3737)

        These are just a few examples of how NARA has leveraged and continues to leverage our research. Watch our Tech Tuesdays blog for more examples over the coming weeks and months (http://blogs.archives.gov/online-public-access/?cat=52) or check out our Facebook page (https://www.facebook.com/NARACAST)

        Hope this helps.

        Mark

  4. The fact that I was involved in digital humanities back in the 1970s and came to digital archives with a keep-the-corpus mentality may have something to do with the fact that the TACC work for ERA is being guided by one of my first PhD students. I still have that mentality: what about the DEA case against Dr. Armando Angulo that was just dismissed because the DEA couldn’t store 2 terabytes of data against him? This argues for pushing a concern for big data to a place before accession, at the point of appraisal and scheduling; and/or, NARA can engage to be the cloud for agencies that can’t afford a few hundred dollars for storage.

Leave a Reply

Your email address will not be published. Required fields are marked *