National Archives Digitization Tools Now on GitHub

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

File Analyzer and Metadata Harvester

This application functions like a digitization Swiss army knife. The application allows a user to analyze the contents of a file system or external drive and generates statistics about the contents of the contained directories. The application can be used to generate checksum values to ensure the bit-level integrity of files after they have been copied to a new device. After a collection of files have been converted from one digital format to another, this application can verify that there is a one-to-one match of before and after files. For the 1940 Census project, NARA’s Digitization Services scanned and indexed 3.9 million images that will be published online. This application was critical to ensuring that each original file was accounted for in the final set of files that will be published online!

The File Analyzer can also import data created in an external spreadsheet. File Analyzer results can be matched and merged with auxiliary data from an external spreadsheet or finding aid.

The GitHub repository for the “File Analyzer and Metadata Harvester” contains additional information about this application.

Video Frame Analyzer

This application is used to analyze technical properties of individual frames of a video file in order to detect quality issues within digitized video files. Within video files, the quality issues that might arise vary from collection to collection. This application allows the user to configure the tests to be performed against a file and to tailor those setting to a specific collection. Staff in the AV Preservation Lab saw a 50% reduction in the time that it took to perform quality checks. The quality checks changed from purely subjective criteria to objective criteria plus a manual review of suspect files. NARA shared a prior version of this application with the Smithsonian Institution and they saw similar results.

Courtney Egan from NARA’s Digitization Services is scheduled to give a presentation on the use of this application to the Association of Moving Image Archivists in November.

The GitHub repository for the “Video Frame Analyzer” contains additional information about this application.

Both the “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” were developed by Terry Brady, Information Technology Specialists for the National Archives in consultation with staff from the Digitization Services Branch. Terry has recently left the National Archives, but we would like to thank him for his important work in developing these applications and making them available on GitHub. The National Archives hopes that these applications will not only be useful, but also enhanced by the larger community of cultural institutions.

7 thoughts on “National Archives Digitization Tools Now on GitHub”

Looking for documentation on how to deploy the software… where can I find that?

Jessie says:

October 21, 2011 at 11:34 am

Hi Hugh, thank you for bringing that up. Currently, the documentation available is the overview and javadoc API. We may make more update in the future.

If you are looking for the finished product, the executable is located under the bin folder of each repository. You can simply download the .jar file and start playing with it.

If you are interested in the source code, please feel free to fork the repository (instruction on GitHub help page) to obtain your copy.

Hope this helped, and thank you for the interest!

Reply

Hi Hugh- If you’re just trying to download and run the software, here’s what you need to do:
On Github, go to the “bin” directory, click on the jar, then click “raw” to download the file.

Here’s a little more info:
The applications are stored in a jar file.

Java is the executable that runs. Java reads the jar file and then runs the code.

Hope that helps!

Also there’s this bit of info-
The basic File Analyzer application is deployed as a self-extracting jar file. The application requires Java SE 1.6 or higher to be present on the user’s workstation. The application can be launched by double clicking the jar file.

If additional runtime memory is needed when running the application, a simple windows bat file can be created to launch the application with a larger memory allocation.
java -mx1000m -jar fileAnalyzer.jar

I pulled that from the help documentation available on the following page-
https://github.com/usnationalarchives/File-Analyzer/blob/master/doc/NARA%20File%20Analyzer%20and%20Metadata%20Harvester.doc

Hi,

Here at EPA, we are also looking into using GitHub. Our security folks had a few questions, and I was wondering if we could schedule a time to talk and ask about your experience using GitHub.

Please call me anytime: 202-566-0522 (Tues-Thurs) and 410-595-6213 (Mon & Fri)

Thanks,

Sam Bronson,
Office of Environmental Information (OEI)

I just ran the file analyzer on a folder of files in which there are two JPG files that have TIF extensions. File Analyzer incorrectly identified these as TIF.
Based on this quick test it seems to me that the FA is only looking at the file extension and not otherwise analyzing the files – is this correct?
How are you handling this need for verifying that the files are indeed what they are labeled as? DROID or JHOVE?
Thanks for your response.

Hi Kari,

Yes – the File Analyzer tool is only looking at the file extension and not internal signatures or magic numbers. As for how we verify our digital products, it depends. We recently wrapped up work on improving our quality control across the Digitization Services Branch (see this NARations post: http://blogs.archives.gov/online-public-access/?p=7076). We have a number of tools that we can use depending on the file format including JHOVE/JHOVE 2 and DROID. We also have commercially produced tools like Interra Systems Baton for video and audio files. We are still developing our workflows and we plan to make that information available in the future on our Products and Services website (http://www.archives.gov/preservation/products).

– Kate

Share this:

7 thoughts on “National Archives Digitization Tools Now on GitHub”

Leave a Reply to Hugh Cancel reply