Today’s post comes from Jon Fletcher, Archives Specialist (Data Standards) in the Digital Public Access Branch of the Office of Innovation.
A great deal of work goes into preparing digitized records for inclusion in the National Archives Catalog. Each subsequent step along the lifecycle of digitized records is a contributing foundation for the next. It is a multi-party process that entails creative collaboration across the National Archives (NARA) and an important component in making access happen.
The term lifecycle is often used to describe how records begin as unprocessed collections and later become rehoused, arranged, and described (cataloged). Once described, the potential for digitization becomes possible. The existence of digitized records provides the basis for the creation of digital objects, being the pairing of metadata generated during the description stage together with digital image files. To prepare for their eventual import into the National Archives Catalog, staff at custodial units across NARA package digital objects in a manner that is compliant with specific technical and authority (controlled vocabulary) requirements. The resulting data set is then transferred to the Office of Innovation for post-processing work.
Post-processing begins with a number of compliance checks. Like an airplane must go through a preflight checklist prior to takeoff, all incoming digital objects must be confirmed to be Catalog-ready. Compliance prerequisites include, but are not limited to, image file technical standards for image files, conformity to authority rules, and file-naming conventions that convey how the data is to be structured. Once confirmed, post-processing work can begin. Digital image files are copied from their source, uploaded to a cloud server, and paired to their corresponding descriptions. Pathways are generated pointing to the new cloud-based image file locations. At this point, the data set as a whole is ready to be transformed.
What is transformation? This is a second-tier post-processing step which structures data in a way that alters it from human-comprehensible language to machine-comprehensible language, based on instructions given by each custodial unit. To illustrate, instructions may be provided to structure a single collection into two distinct parts: one consisting of preservation originals and the other of public access copies. Another example of structuring data into logical sections would be the grouping certain image files together. For example, file-naming conventions may give preliminary instructions that “this group of nine image files should be grouped together to form a nine-page document,” but ultimately this is accomplished by means of structuring data during the transformation process.
The result of this transformation work is that instructions received from staff across NARA become “structured, machine-readable instructions.” It is both a human-process and an automated process. The human component entails analyzing instructions provided by NARA staff and structuring data to accurately reflect them. The automated component entails reformatting the resulting machine language script so that is compliant with the current needs of the National Archives Catalog. Any technical errors in the resulting structured data will be pointed out by backend tools for repair before or during the import process. The end product, a machine language script of structured data encompassing an entire set of digital objects, is then imported into the Catalog.
If transformation as a concept seems abstract, consider the difference between a standard paper printer and a 3D printer. While a standard paper printer is capable of conveying comprehensible concepts through “flat” image and language, a 3D printer can with the proper instructions transform concepts into physical objects that have carefully structured dimensions. Images and language can effectively convey concepts but dimensional objects can potentially offer more concrete representations of those same concepts in certain applications. Both printer types rely on a set of instructions for output, but only the 3D printer’s output allows for tactile and visual examination of the object and its many dimensions and sub-components. This is possible because the 3D printer’s instructions account for the output’s dimensional structure.
Likewise, transformation takes a set of instructions initiated during the cataloging process, packaged as a set of flat instructions during the digital object preparation process, and further prepares it for “3D” output. The resulting output is a structured data with “dimensions” that can be queried and examined as a whole, in part, granularly, or comparatively with additional queried results. While the output’s most immediate application is the National Archives Catalog platform, structured data also allows for data portability. Portability allows for the sharing large swathes of data or individual digital objects across other electronic platforms. Another example of portability is the enabling of API (Application Programming Interfaces) which allows the National Archives Catalog’s robust bank of structured data to be queried and made use of externally—for which the potential applications are limitless.
This is an amazing gadget.
Is this archive like the wayback machine? Does it take a snapshot of everything on the internet?
Thank you for your question. The U.S. National Archives does not take a snapshot of everything on the internet. We only collect federal records, which are records generated by the federal government and government agencies. The following paragraph from the “Our Holdings” section of the About the National Archives webpage provides some information on the magnitude of our holdings:
“NARA keeps only those Federal records that are judged to have continuing value—about 2 to 5 percent of those generated in any given year. By now, they add up to a formidable number, diverse in form as well as in content. There are approximately 10 billion pages of textual records; 12 million maps, charts, and architectural and engineering drawings; 25 million still photographs and graphics; 24 million aerial photographs; 300,000 reels of motion picture film; 400,000 video and sound recordings; and 133 terabytes of electronic data. All of these materials are preserved because they are important to the workings of Government, have long-term research worth, or provide information of value to citizens”