Why metadata standards
Harvard's Visual Information Access System (VIA) has been developed in response to a recommendation from the Visual Resources Task Group of the University Library Council. The report of the Task Group described wide diversity in local cataloging practices in Harvard repositories of visual materials. One of the major obstacles to providing a union public catalog of visual resources was the lack of a commonly agreed upon set of data elements and a way to communicate them which would support interoperability of metadata from different units.
Traditional library collections have benefitted from decades of national and international standards development. Visual resources collections have not taken this path. In many collections, the needs to be met by cataloging were deemed to be local ones, and the application of cataloging standards was viewed an intrusion on local autonomy -- as a burden with no offsetting benefits. No infrastructure such as OCLC or RLIN grew up to support cross-institutional cooperation. By and large, each visual resource collection has stood alone.
Only now, as automation options proliferate and information delivery via the Web becomes a reality, has the larger visual resources community begun an effort to develop or adopt common cataloging standards. Descriptive standards (and metadata standards in general) take several forms. There are
- standards specific to topics or disciplines (such as biology or art)
- standards specific to kinds of materials (such as moving pictures or encoded texts)
- standards to support particular functions (such as discovery or rights management or presentation)
In any of these areas, metadata standards may govern
- what pieces of information are created (data dictionaries and semantics)
- how the information is formed (content standards and vocabularies)
- how the information is encoded for computer processing (syntax)
Virtually no standards govern all of these aspects of metadata.
The scope of VIA, at least initially, is limited to images and objects in the area of material culture. Scientific and medical images were excluded at this time because their metadata and access needs are likely to be different and should be evaluated separately. With the scope limited in this way, the VIA Access Task Group could focus on those standards which might be relevant to material culture. While VIA does support material culture objects, the bulk of its content will come from study collections of images -- that is, visual documents or surrogates of objects. Frequently, large numbers of surrogates exist for a single work. Think of the Mona Lisa itself and images of the Mona Lisa: a full image, her nose, her smile, an infrared image, an x-ray. The expression of this kind of relationship is a critical need for visual resource collection.
In Phase I of VIA, all participants have local collection management systems in place and already have descriptive metadata in machine readable form, so they have a significant investment in their existing metadata. The participants include art and architecture study collections, a special collection, art museums, an archive, and an archaeological/ethnographic museum. As you can imagine, the process of moving toward commonality in data elements, semantics, vocabularies, and descriptive codes across such diverse institutions will be sensitive, complex, time-consuming, and incremental.
Critical requirements in Phase I of VIA included these:
- to make the metadata from different repositories physically compatible,
- to represent the metadata accurately and effectively, and
- to provide an environment where, by seeing how their metadata interact, the participants could work toward greater agreement on the form and content of the metadata.
For these purposes, the VIA Access Group developed a data model, a list of common data elements and definitions, and a list of known vocabularies. OIS developed an XML Document Type Definition (DTD) as a transport syntax for the data model and elements defined by the VIA Access Group. The Access Group reviewed the few relevant standards which exist or are under development in this area, and incorporated them wherever possible. It is important to note that most of the relevant standards are in their infancy, and will inevitably grow, change, or fall by the wayside over the next few years.
By and large, syntax is the least difficult aspect of metadata. It is critical for machine processing, but since each system feeding VIA has its own syntax, the task was to establish a common syntax for communicating with between those systems and VIA. There is no predominant syntax in use in the visual resources community.
MARC. The MARC Bibliographic Format is a superset data dictionary and syntax. While extremely rich and complex, as a data format MARC is showing its age. It is difficult to process, and the tools to create MARC from other forms of data require considerable technical expertise. The MARC Bibliographic Format does accommodate visual materials, but not in a way that fits the kind of description and access that visual resource collections provide. There are two main issues:
- As an essentially flat descriptive model, MARC does not represent well the kinds of relationships and hierarchy typical in visual resource description.
- In order to coexist with other kinds of MARC data, the visual materials format is complex in ways that visual resource collections find unnecessary and burdensome.
The MARC visual materials format tends to be used by "traditional" libraries for their videos and motion pictures, which are more analogous to print publications than slides or photographs are, and which are more integrated into the utility-based copy-cataloging processing model.
RDF. The Resource Description Framework (RDF) is an XML implementation which provides a generalized syntax for expressing metadata. The RDF specification has been under development for the past year and became a standards-track World Wide Web Consortium recommendation only last week. RDF can only express metadata defined by other standards; it does not itself define any metadata. Once RDF is stable, the VIA transport syntax can be switched to RDF.
CDWA. Categories for the Description of Works of Art is a data dictionary only (that is, it provides a list of data elements which could be relevant to the description of artworks, and corresponding definitions). It is quite a detailed and extensive list, but it applies only to artworks. There is no structure, no syntax, and no standards on what the content should be or how it should be formed.
VRA Core Categories. The Visual Resources Association Core Categories is a data dictionary explicitly developed to address the needs of slide and photograph collections. It draws on the CDWA, but adds a structural approach B there are two groups of data elements defined: one for characteristics of the work being depicted, another for characteristics of a visual document or surrogate. The VIA Data Element List draws heavily on the VRA Core, while allowing for one more layer of hierarchy: VIA allows the description of a group of works, in addition to VRA's work and surrogate. Like the CDWA, the VRA Core is not associated with any syntax, nor does it dictate how the content of the data elements should formulated.
Both the Fine Arts Library and the Graduate School of Design participated in the recently completed testbed of the VRA Core (the VISION project). The VRA Data Standards Committee (with member and President-elect Ann Whiteside of the GSD) is evaluating the results and contemplating next steps. ALA=s Association for Library Collections & Technical Services has just charged a Task Force on the VRA Core Categories (including Robin Wendler, OIS) to examine the relationship between VRA Core and AACR2 over the next two years.
Dublin Core. The Dublin Core (DC) is a short, simple list of data elements intended to facilitate discovery of electronic resources. It can be used for non-electronic resources as well, but its semantics are extremely broad (e.g. Coverage; Description). DC prescribes the form of content for a few data elements: those best processed by machine (such as Date) or around which system functionality is desired (such as a pick-list of Formats). DC is not tied to a particular syntax, and to date has mainly been expressed in HTML <META> tags. In the future, RDF is expected to be the primary syntax for DC. In a later phase, the VIA transport syntax can be changed to a qualified, extended version of DC expressed in RDF.
Additional standards development is going on in the museum community in the
- Consortium for the Computer Interchange of Museum Information
- Art Image Museum Consortium (to which Harvard subscribes)
- Museum Digital Library Collection (of which Harvard Art Museums and the Peabody Museum are members, Katherine Jones-Garmil of the Peabody Museum is Senior Project Manager)
From that list, it is clear that no single standard will emerge any time soon, but that Harvard units and staff are actively participating in and monitoring developments.
Cataloging codes such as AACR2 are not generally accepted in the visual resources community. Many, perhaps most, visual resources lack the "intrinsic metadata" on which mainstream cataloging codes rely: title pages or other "chief sources" of information, imprints, etc. In addition, the value and meaning and consequently the description of visual resources tends to be more contextual than it is for print-based resources.
More agreement is possible in the use of vocabularies, but it will take time. The VIA Access Working Group has identified over 20 vocabularies (international, national, and local) currently in use in Harvard visual resource collections. These include not only the main library thesauri from the Library of Congress, but also vocabularies from the Getty Information Institute such as the Union List of Artists Names and the Art and Architecture Thesaurus, as well as many locally developed term lists. Even discussing the order of personal names or the structure of place names can reveal fundamental conceptual differences in the way various repositories think about access to their materials.
Next steps for the Access Group include
- resolving discrepancies in data mapping across repositories revealed by phase 1 of VIA
- encouraging the use of standard vocabularies
- investigating whether machine-readable versions of thesauri such as the Art and Architecture Thesaurus and the Union List of Artists Names could be used to improve consistency of metadata from different units, either through pre-processing during the loading of metadata into VIA or through some kind of online functionality
- encouraging discussion of conceptual approaches to visual resource cataloging and documenting best approaches and guidelines.
There is at this time no consensus in the visual resources community about the preferred form, content, or syntax of visual resource cataloging, nor is there any model in place by which this data can be shared. Both the Dublin Core and RDF are in development and subject to change, and implementing them now would force us to use our resources in adapting to outside changes in an unstable environment. However, VIA metadata can readily be mapped into a version of the Dublin Core using the RDF syntax when that becomes appropriate. Such a format could be used as a transport syntax either during loading into VIA or if at some future date VIA becomes a Z39.50 server delivering metadata to other systems.
Harvard is actively working in the community to encourage standards development for visual resources. Hopefully, Harvard's experiences in trying to achieve some internal consensus, occurring in parallel with efforts in the visual resources and museum communities, will contribute to the evolution of standards in this area.