Archive Ingest and Handling Test (AIHT)
By both policy and design, Harvard Library's Digital Repository Service (DRS) is intended for highly "curated" digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats. The Library of Congress organized its Archive Ingest and Handling Test (AIHT) to investigate issues surrounding the import, export, and manipulation of a sizeable test corpus - approximately 57,000 files (12 GB) in more than 90 formats - of unknown provenance by institutions with very different preservation strategies and technological infrastructures. Harvard participated in this test along with Johns Hopkins University, Old Dominion University, and Stanford University. The results of the test highlighted the need for community agreement on descriptive and packaging standards, file transfer and validation tools, and best practices for repository operation and preservation planning.
- AIHT: Conceptual Issues from Practical Tests, D-Lib Magazine (December 2005)
- Harvard's Perspective on the Archive Ingest and Handling Test, D-Lib Magazine (December 2005)
Digital Repository Certification
The Research Library Group (RLG) and the National Archives and Records Administration (NARA) created a joint task force to develop criteria for identifying digital repositories meeting minimal standards for trustworthiness in the professional management and preservation of digital content. Harvard staff participated in the development of these criteria, which include both technical and organizational metrics, as well as in follow-on work by the Center for Research Libraries (CRL) to develop a specific audit methodology. Harvard staff use the audit guidelines development by the RLG/NARA task force for self-assessment of the Digital Repository Service (DRS) and for planning for its future functional and operational enhancements.
DRS Self-Assessment Project
The Self-Assessment Project began in September 2015, as part of an effort to evaluate and improve the DRS. The Assessment was undertaken by a resident from the National Digital Stewardship Residency, a program that places early-career preservation professionals in institutions across the US. Residents are assigned a specific project and are guided by a mentor.
During the project, the DRS was assessed using the benchmarks outlined in the ISO16363, a high international standard for digital repositories. The ISO16363 is comprised of several metrics, which address many aspects of a repository, such as financial planning and ingest processes. During the project, the resident determined whether the DRS met each metric. The resident interviewed staff members and reviewed existing documentation. She also sought to match the documentation to the metrics, as the documentation provides necessary evidence that the repository meets the metric. Finally, the resident identified areas needing improvement and gaps in documentation, as well as commonalities among these gaps.
DRS Storage Migration
In the Spring-Summer of 2017 the DRS storage was upgraded. This was the fourth upgrade of the DRS storage. A key benefit of this latest upgrade is that the DRS copies are geographically distributed further away from each other.
DuraCloud Pilot Project
During the summer of 2011, Harvard conducted a pilot project using DuraCloud, the cloud storage solution offered by DuraSpace. DuraCloud was tested as a potential additional preservation storage location for DRS content. This project ran from May-August 2011.
With funding from the Andrew W. Mellon Foundation, Harvard conducted an in-depth study of the licensing, economic, organizational, and technical issues involved in building a large-scale archive of electronic journals. Working with Blackwell Publishers, the University of Chicago Press, and John Wiley & Sons, Harvard analyzed E-journal content and technical formats, contractual arrangements under which an archive could be assembled and accessed, and who would benefit by and who should pay for archiving. Internally, a technical team studied what changes to the LDI infrastructure were required in order to archive content of this complexity and scale over extended time frames. The results of the Harvard study were widely reported and discussed at meetings of the Digital Library Federation (DLF), the American Library Association (ALA), the American Association for the Advancement of Science (AAAS), the Society for Scholarly Publishing (SSP), and the Coalition for Networked Information (CNI), among others. The April 2002 report is available from the DLF website:
As a follow-up to the 2002 Mellon Foundation-funded e-journal archiving project, Harvard's LDI program funded a collaborative project with the National Library of Medicine (NLM) to produce an open-source archival and interchange XML Document Type Definition (DTD). The DTD is designed to increase the ease of interchange between publishers and archives for article-level e-journal content. Without this DTD, the structure of e-journal content can vary widely, requiring costly human intervention and multiple parallel workflows within archival repositories. The DTD was designed after extensive document analysis in many subject domains to ensure that it does not reflect the bias of any particular academic discipline. Based on public standards, the DTD features a modular structure that allows customizing and that should be an easy target of transformation from existing XML- or SGML-encoded content. In addition to being used by NLM for the PubMed Central archive, this DTD is well positioned to become a standard format for the transfer and archival storage of scholarly literature.
Election 2012 Web Archive
As a member of the International Internet Preservation Consortium (IIPC), Harvard Library, including staff at the Harvard Kennedy School Library and at the Library Technology Services, collaborated with other academic libraries as well as non-profit and government organizations on a web archiving project. Through the Election 2012 Web Archive, we will collect and preserve web sites related to the 2012 election campaign in the United States. Our goal is to capture this important historical record to ensure long-term preservation and accessibility for teaching, research and the general public.Subject experts at the Harvard Kennedy School and at other academic institutions in areas such as political science and public policy are identifying relevant web sites for long-term preservation.For more information about the IIPC, see their web site.For more information about web archiving at Harvard, see About WAX.To search or browse Harvard’s web archive, see the WAX start page.
Global Digital Format Registry (GDFR)
Preservation activities depend upon extensive knowledge of the formats which are used to represent digital content. Since this same information is useful to all institutions interested in preserving their digital assets, great economies of scale can be achieved from a centralized repository for this format information. Harvard Library staff have been instrumental in articulating this concept within the digital library and preservation communities. With funding provided by the Andrew W. Mellon Foundation, Harvard and OCLC engaged in the development of a Global Digital Format Registry (GDFR), a peer-to-peer network of independent, but cooperating format registries that used a common protocol to synchronize their holdings of important format documentation and technical information. Beyond the technical work in creating the software underlying the GDFR network, Harvard cooperated with the National Archives and Records Administration (NARA) in an investigation of the business and governance issues needed to be addressed in order to ensure that the GDFR will remain viable over time as a core service to the preservation community.
This project is no longer active but the documentation will continue to be made publicly accessible here in recognition of the fact that this was an important project in the initiative to build a format registry for the digital preservation community.
Meetings and Presentations
- Minutes of GDFR Pilot Discussion Meeting, NARA, Washington DC, July 10, 2008. (PDF/A-1)
- "Global Digital Format Registry Progress", Presentation given by Andrea Goethals at the NDIIPP Digital Preservation Partners' Meeting, July 9, 2008 (PPT)
- "GDFR Pilot Discussion", Presentation given at the GDFR pilot discussion meeting at NARA in Washington D.C on July 10, 2008. (PPT)
- "The Global Digital Format Registry (GDFR) Project", Presentation given by Stephen Abrams and Andreas Stanescu at CNI, Washington, DC, December 10-11, 2007 (PPT)
- "Global Digital Format Registry (GDFR)", Presentation given by Stephen Abrams, University of Edinburgh, November 4, 2006 (PPT)
- "Global Digital Format Registry", Stephen Abrams and David Seaman, DLF, Berlin 2003. (PDF)
- Analysis Model (October 1, 2007) "Global Digital Format Registry Analysis Model Version 2.0". Stephen Abrams, Harvard University Library. (PDF/A-1)
- Data Model (May 22, 2008) "Global Digital Format Registry (GDFR) Data Model v.5.0.14". Stephen Abrams and Andrea Goethals, Harvard University Library. (PDF/A-1) Description of the faceted classification scheme for formats.
- Format Classification (November 9, 2007) "Classification v.1.05". Stephen Abrams, Harvard University Library. (PDF/A-1)
- Format Model and Relationships (October 23, 2007) "Format Model and Relationships v. 1.0.10". Stephen Abrams, Harvard University Library. (PDF)
- Technology Platform (January 4, 2007) "Global Digital Format Registry Technology Platform Version 0.2". (PDF/A-1)
JHOVE (JSTOR/Harvard Object Validation Environment)
An extensive technical description of the formal characteristics of a digital resource is a necessary precursor to preservation planning for or intervention on that resource. These characteristics are highly dependent upon the format used to represent the resource's abstract content. With funding from the Andrew W. Mellon Foundation, Harvard Library staff collaborated with the JSTOR Electronic-Archiving Initiative (now known as Portico) to produce an extensible tool, called JHOVE (the JSTOR/Harvard Object Validation Environment, pronounced "jove"), for automating format-specific identification, validation, and characterization of digital resources. Harvard and JSTOR have made this tool available to the wider community under an open source license, and it is widely deployed internationally. JHOVE has facilities to extract important technical characteristics of resources created in many commonly-used formats, such as AIFF and WAVE (audio); GIF, JPEG, JPEG 2000, and TIFF (still image); ASCII and UTF-8 (text); and PDF.
- JHOVE is now being maintained external to Harvard as an open source project.
LOCKSS (Lots of Copies Keep Stuff Safe)
The goal of LOCKSS is to preserve access to web-based content, primarily e-journals, by maintaining multiple copies at physically disparate locations and by conducting periodic comparisons among them to ensure that materials remain consistent, authentic, and accessible. Harvard Library staff participated in the alpha and beta development phases of the LOCKSS system.
METS Java Toolkit
This Java toolkit was created for the procedural construction, validation, and marshalling and unmarshalling for METS. METS, the Metadata Encoding & Transmission Standard, is intended to provide a standardized XML encoding for transmission of complex digital library objects between systems. While it provides standard containers and encoding mechanisms for descriptive and administrative metadata, it does not define the content or format of that metadata. However, the content and format of structural metadata is explicitly mandated within the METS specification.
The METS Java Toolkit was maintained by Harvard Library from 2004-2006 but has since been retired.
Harvard Library, in partnership with MIT Libraries, was awarded a 2013 Laura Bush 21st-Century Librarian Program Grant from the Institute of Museum and Library Services (IMLS) to test the National Digital Stewardship Residency (NDSR) model in the Boston area.
The NDSR model was developed by the Library of Congress in partnership with the IMLS and was first piloted and continues to run in the DC area. The residency program was designed to develop the next generation of stewards to collect, manage, preserve, and make accessible our digital assets. This grant ran from June 3, 2013 - May 31, 2016. For more information about this project, see the project website.
NISO Z39.87, Data Dictionary – Technical Metadata for Digital Still Images
- ANSI/NISO Z39.87, Data Dictionary - Technical Metadata for Digital Still Images
- MIX: NISO Metadata for Images in XML Schema
Adobe's Portable Document Format (PDF) has rapidly become a de facto standard for the dissemination and presentation of electronic documents on the web. Unfortunately, the feature-rich nature of PDF permits tremendous variability in the internal structure of these documents. Further, it allows documents to be dynamically composed at the time of their display from separate external resources, which leads to significant difficulties in ensuring their long-term viability. In order to address these concerns, the International Organization for Standardization (ISO) convened a Joint Working Group to produce a constrained version of PDF suitable for archival preservation, known as PDF/A. Stephen Abrams, then Harvard Library's digital library program manager, was the project leader and document editor for the initial version of the PDF/A standard. The PDF/A standard defines the features that should be required, recommended, restricted, or prohibited in order to make electronic documents more amenable to long-term preservation.
- For more information visit the PDF/A Competence Center.
PREMIS (Preservation Metadata Implementation Strategies)
Registry of Digital Masters
To avoid unnecessary and expensive duplication of digital reformatting efforts, Harvard Library staff participated with the Digital Library Federation (DLF) in plans for a national digital registry of born-digital materials and digitally reformatted books and journals. By consulting the registry before digitization efforts are undertaken, a content owner can determine if an appropriate digital version already exists and is being preserved in a professional manner that obviates the need for local management.
Page Image Compression for Mass Digitization
Page Image Compression for Mass Digitization In late 2006, Harvard Library, the California Digital Library, the Internet Archive, and the Bibliothèque nationale de France conducted a collaborative investigation of the the use of lossy JP2 compression for mass digitization of texts. The findings are documented in the IS&T Archiving 2007 Conference Proceedings. Please consult the published paper or this preprint.
Unified Digital Format Registry (UDFR)
The UDFR is an initiative begun in April 2009 to build a single shared formats registry. UDFR builds on years of work performed by a number of institutions internationally, including Harvard, whether it was for PRONOM, the Global Digital Formats Registry (GDFR), or other format registry projects. The UDFR was developed at the University of California Curation Center (UC3) with funding from the Library of Congress.
Zone 1 Rescue Repository
Zone 1 is a project, begun in June 2011 and funded by the Harvard Library Lab. The Zone 1 prototype rescue repository will serve as a proof-of-concept for a future production version of the repository which would close the gap in secure storage solutions at Harvard that currently exists. The repository will have a low deposit barrier to ensure that valuable content at Harvard won't be lost. It will also serve as a conduit of review by the Harvard community for potential re-use or long-term stewardship of the content and will facilitate transfer to other repositories. For more information see the Zone 1 web page.