Page tree


Harvard Library collections include documents and other “office-like” material that will be preserved in the Library’s preservation and access repository - the Digital Repository Service (DRS). As a first step towards providing support for this material in the DRS, the Library contracted Paul Wheatley Consulting Ltd. in late 2014 to assist with the analysis. The goals of the analysis were:

  • Recommended word prcessing formats to accept and prefer for the DRS
  • Recommended technical metadata schema(s) to use for files in word processing formats
  • DRS content models for these objects
  • Recommendations for enhancing Harvard Library’s FITS tool to better support these objects

 The driving principles of this work were to:

  • Provide interoperability with the existing metadata schemas and workflows of the DRS
  • Provide sufficient metadata for long-term preservation of word processing objects
  • Adhere to existing standards where possible
  • Propose simpler models over more complex ones where possible

Specifically the analysis was conducted in three areas: formats, metadata and tools. After the analysis conducted by Paul Wheatley, Harvard Library conducted additional tool testing. The deliverables from this analysis are included on this page.

Format Analysis

Format matrix - Word Processing FilesThis spreadsheet compares word processing formats according to preservation criteria.
The column headings are shaded according to the criteria importance of Harvard Library (red: very important, orange: somewhat important, yellow: somewhat unimportant, green: unimportant).
The cells are shaded to visually represent the value for the format (green is good, yellow is neutral, red is bad) 
Paul Wheatley

Format profile - Apple iWork Pages v04.docx

Format profile - ePUB v04.docx

Format profile - Microsoft Office Binary Word Document v04.docx

Format profile - ODT v02.docx

Format profile - OOXML Document v05.docx

Format profile - RTF v03.docx

Format profile - Wordperfect v02.docx

These documents are brief profiles of the word processing formats that were under consideration for acceptance in the DRS.
Along with descriptive information about the format, they include a summary of risks and potential strategies for mitigating the risks. 
Paul Wheatley

Metadata Analysis

Metadata approachThis document explains the rational for recommending particular metadata fields to be captured for word processing documents for Harvard Library's preservation use case.Paul Wheatley
Metadata summaryThis spreadsheet maps the recommended metadata fields to capture to whether it would be an addition to DocumentMD, the preservation rationale for capturing it, and Tika support for extracting it.Paul Wheatley

Tool Analysis and Testing

Tool assessmentThis document provides an overview of the tool assessment activities that were carried out, along with specific recommendations on how to proceed.Paul Wheatley

Other Resources

Preservation approachThis document provides a summary of risks, a categorization of preservation formats, thoughts on preservation strategy, and recommendations for further research and testing.Paul Wheatley
  • No labels