Dublin Core Metadata and the Cataloging Rules: Draft Final Report

Committee on Cataloging: Description and Access
Task Force on Metadata and the Cataloging Rules

Dublin Core Metadata and the Cataloging Rules

Contents:
Metadata and Cataloging [for the Introduction]

Dublin Core Metadata
Background
Dublin Core Metadata’s Support of the Four User Tasks
Dublin Core Elements as Sources of Cataloging Data
Rules in Chapter 9 of AACR2

Metadata and Cataloging

[for the Introduction to the Task Force Report]

In assessing the relationship between metadata and cataloging data, we propose to examine the suitability of each to fulfilling various user tasks. These tasks have been formulated as part of an entity-relationship model by an IFLA Working Group and published as part of the document Functional Requirements for Bibliographic Records. [Note: The final text of this document is not yet available, and the following analysis is based on the draft version circulated for world-wide review in May 1996.]

The IFLA model proposes four basic user tasks:

to find entities that correspond to the user’s stated search criteria (i.e., to locate an entity in a file or database as the result of a search using an attribute or relationship of the entity);

to identify an entity (i.e., to confirm that the entity described corresponds to the entity sought, or to distinguish between two or more entities with similar characteristics);

to select an entity that is appropriate to the user’s needs (i.e., to choose an entity that meets the user’s requirements with respect to content, physical format, etc., or to reject an entity as being inappropriate to the user’s needs);

to acquire or obtain access to the entity described (i.e., to acquire an entity through purchase, loan, etc., or to access an entity electronically through an on-line connection to a remote computer).

Both cataloging data (bibliographic records) and metadata support all of these user tasks to some extent. However, the Dublin Core in particular places emphasis on resource discovery (primarily the find task, although it has elements of the identify and select task) and retrieval (the obtain task). Its objective is explicitly not to create a complete surrogate for the entity, and therefore the ability to identify and select a particular manifestation through the Dublin Core metadata may be significantly limited.

Cataloging data not only seeks to support all four user tasks (although the obtain task is mostly supported by local information that is not part of international cataloging standards). In addition, cataloging data is optimized to support each task and seeks to maximize the user’s chances of success in their efforts. Standard cataloging principles, rules, and practices have developed over the past century, and contemporary cataloging databases embody high standards of information quality. Considerable effort has been expended by catalogers throughout the world, working cooperatively to promote this quality. In particular,

FIND: Cataloging principles promote the user’s ability to find a desired object in several ways.
1. Rules and practices seek to identify those attributes of a bibliographic entity most likely to satisfy a user’s query. These practices are based on commonly-recognized principles of authorship, organizational/corporate responsibility, roles of various persons and bodies in the creation of various kinds of bibliographic entities; on recognized conventions for naming persons, publications, and other relevant entities; on long experience in assessing the significance of relationships among entities.
2. Rules and practices that seek to optimize retrieval by insuring (a) that every entity has a distinct name and (b) only one name is conventionally used for each entity. Further practices seek to insure that all variant forms of name are retrievable. This practice is commonly known as authority control and is the single most significant contribution of catalogers to the retrieval process.
3. Rules and practices that base the conventional form of names, titles, etc., on literary warrant or the common usage within the universe of bibliographic entities.
IDENTIFY: Cataloging principles promote the user’s ability to identify a retrieved entity, i.e., to determine whether that entity indeed satisfies the user’s query and to distinguish among entities with similar attributes. A number of principles and practices promote this task:
1. The basic principle of cataloging description – going back to the work of Panizzi and Cutter (if not earlier) – is that bibliographic entities are most usefully identified by the information that is contained in the entities themselves and that the most formal statements tend to be the most useful. Therefore, a bibliographic description is constructed by transcribing information from prominent sources within the item, e.g., from the title page of a book. This principle applies particularly to such information as titles, the role of various creators/contributors, and publication information. However, it also applies to such information as technical requirements for using the material. Although not all significant information about an entity can always be obtained by transcribing data, it is the best place to start. The result of such transcription is a surrogate for the entity that allows provides the user with a wealth of identifying information prior to selecting and obtaining a copy of the entity.
2. Considerable attention is paid in cataloging rules and practices to distinguishing among entities with similar attributes, such as various manifestations of the same work (different versions of the text or different physical manifestations). There are specific technical categories, applicable in particular to printed works, that help to identify the extent of difference among manifestations (edition, issue, printing, etc.). When a user needs to obtain a particular manifestation of an item, such information is of vital importance. The world is still working out how to apply similar concepts to electronic entities, but the ease of duplication and modification of electronic entities makes this a very important issue.
3. Beyond these basic principles and practices, cataloging conventions emboy a century of experience in determining what additional data elements are useful in identifying an item and in assessing its relevance to a user’s query. Examples of such data elements are technical details of various kinds, content analysis (not only a list of contents, but indications of the presence or absence of features such as illustrations, indexes, bibliographies, etc.), relationships to other entities (other works, persons and organizations contributing in various ways).
SELECT: Users select entities for retrieval based on a variety of factors. Many of them have to do with the identification of the specific entity or with the presence of specific features; these have been addressed in the previous section. In addition, users select on the basis of factors such as currency of information, level of treatment (textbook survey vs. scholarly monograph), extent of treatment, authoritativeness (author’s or publisher’s credentials), subject relevance. Almost any data element included in a bibliographic record may be important in influencing a user’s selection of material.
OBTAIN: In order to obtain a copy of a selected entity, the user must rely on the services provided by the entity’s owner/custodian/provider. For most material physically held in a library or repository, this is the library’s identity and the identifier(s) assigned to the item (such as a call number). This is the least standardized aspect of library cataloging practices, but the wide availability of databases of cataloging information is promoting similar systems for obtaining documents on a regional, national, even global basis. With electronic information, obtaining access is perhaps even simpler. All that is needed is an accurate identifier.

Dublin Core Metadata

Background

“The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries.” [Dublin Core Metadata home page – http://purl.org/metadata/dublin_core/]

The Dublin Core (DC) is designed for maximum simplicity and flexibility. It is expected that DC metadata will be provided by the creators or distributors of the resources themselves, perhaps by filling in a form in their authoring program. On the other hand, Dublin Core can be qualified and extended to meet the requirements of a variety of users. It is theoretically possible to encode most of a fully standard AACR2 description in DC metadata elements, and it is anticipated that some metadata creators will do exactly that. However, the Dublin Core is directed at a broader and less exacting set of resource producers, and the content of a typical set of DC metadata is likely to be less full and less rigorous in its content.

The principles governing use of the Dublin Core elements are simple and straightforward. All elements are optional; all elements are repeatable; order of elements is optional; and, all elements can be qualified by language (language of the metadata) or scheme (authority or standard for the content).

Our object is to evaluate Dublin Core metadata as a source of cataloging data for records based on the Anglo-American cataloguing rules. The Task Force recognizes that metadata in general and the Dublin Core in particular have applications other than AACR2-based cataloging records. Indeed, it is arguable that Dublin Core metadata might be applied most effectively in a system designed specifically to support its use, rather than in library cataloging databases. This is one of the questions that this report will explore.

On the other hand, if library cataloging databases are to contain records for Web resources (as it is certain that they will), Dublin Core metadata contains a wealth of information that can be used in those records. We will evaluate the kinds of information that each Dublin Core element may contain and indicate how that information can be used in preparing an AACR2-based cataloging record.

Finally, we will discuss the rules in Chapter 9 of AACR2 and make recommendations about the need for changes to those rules to support the use of metadata as a source of cataloging data.

Note: Much of the argument here will make use of a distinction between metadata as a source of information and metadata as a source of cataloging data. “Source of information” is a technical term in AACR2, referring to a source from which information is transcribed in various elements of the cataloging record. In order to contrast with this technical terminology, we have used “source of cataloging data” to refer to factual data on which various elements in the cataloging record may be based; it is thus a much broader concept which includes not only exact transcription or quotation, but summarization or reformulation of the factual information by a cataloger.

Dublin Core Metadata’s Support of the Four User Tasks

Dublin Core metadata supports the four user tasks set forth in Functional Requirements for Bibliographic Records to varying degrees, but its lack of established rules and procedures governing the content of data elements makes Dublin Core elements less reliable than cataloging data. The explicit simplicity of the element set and the fact that all elements are optional also undermines the reliability of Dublin Core metadata. The following discussion notes the relevance of Dublin Core elements for each of the user tasks.

FIND: Dublin Core metadata is designed primarily to support the discovery or finding of electronic resources. The elements are intended to be the most significant pieces of information by which a user might seek an electronic resource. The elements include the TITLE, the CREATOR (author, etc.), OTHER CONTRIBUTORS, SUBJECT – elements that are likely to be primary search categories. There are other elements that are likely to be secondary or restricting features of a search (LANGUAGE, COVERAGE, FORMAT).
Although there are only limited requirements about the content of these elements, the content may be optimized in the same manner as cataloging data. For example, the content of the metadata elements may be literally identical with the same information shown in eye-readable form on the resource. And controlled vocabularies and authority control practices may be applied to the content of name (CREATOR, CONTRIBUTOR) and SUBJECT elements. According to the Dublin Core element description, “To promote global interoperability, a number of the element descriptions suggest a controlled vocabulary for the respective element values.” However, this is not a requirement, and the original intent of the Dublin Core – to capture information supplied by the authors or distributors of electronic resources – will probably apply to some extent to most random collections of metadata. Unless the metadata is created as part of a project that is able to impose its own rules for content, only minimal assumptions about the reliability of the data can be made.
IDENTIFY: The ability to identify the particular resource retrieved and to distinguish similar resources is not one of the explicit objectives of the Dublin Core. However, there are a set of elements that are related to the “instantiation” of the resource: DATE, TYPE, FORMAT, and IDENTIFIER. Unfortunately, the DATE element only provides a limited ability to distinguish versions. This is one area in which Dublin Core metadata may be inadequate to support the user’s needs.
SELECT: Once again, Dublin Core is not intended to provide all the information necessary for a user to make a selection among multiple search results. A certain amount of information is provided. In particular, the resource DESCRIPTION may be of great utility in evaluation the relevance of various resources. The COVERAGE element, if sufficiently precise, may also be useful, as (in fact) may any of the elements, given that user selection can be based on any factor. In general, however, it seems to be the assumption behind the Dublin Core that the entire resource is available for the user’s examination during the selection process. Given networked resources and a reasonably-sized set of search results, it is probably feasible for a user to examine the resources themselves. The larger the result set, however, the less efficient this becomes and the more valuable the metadata surrogates will be. Dublin Core metadata, unlike cataloging data, is not intended as a complete surrogate for the resource, but to the extent that it does represent the resource, it can be used to support selection among resources.
OBTAIN: Dublin Core is intended to support discovery and retrieval. In a networked environment, obtaining a resource should be fully supported by the inclusion of an accurate address in the IDENTIFIER element. Most of the effort in this area has gone towards assuring that the identifiers assigned to electronic resources are – and remain over time – accurate.

Both cataloging data and Dublin Core metadata support the four user tasks, although Dublin Core is only designed to support the finding and obtaining of electronic resources. On the other hand, cataloging data is optimized to support all four tasks in ways that cannot be expected of metadata. In particular, the use of controlled vocabularies and the practice of authority control enhances the ability to find, and the principle of transcription and concepts of versioning enhance the ability to identify and select desired resources. Cataloging practices add considerable value to the raw data provided by the resources described in bibliographic records, and this added value is intended to support to ability to find, identify, select and obtain desired resources.

Our cataloging databases are high-quality tools for information retrieval, but they are only as good as the standards that apply. The integrity and consistency of these databases depends on applying the more or less same standards to all records in the database. If a significant portion of the database does not reflect the same level of consistency, the database becomes unreliable. It is therefore damaging to the quality of a cataloging database to include in it records based on Dublin Core metadata unless that metadata was formulated according to cataloging principles and practices. This may be possible for metadata coming from particular projects which have been able to adopt appropriate standards. However, it is not possible for a broad range of Internet resources containing metadata provided by authors or data producers. For such resources, it would be preferable to maintain the metadata-based records in a separate database. The metadata will provide a higher level of accessibility than the Internet itself, but its lack of consistency will not damage the even higher level of quality we have invested so much in providing in our cataloging databases.

Dublin Core Elements as Sources of Cataloging Data

The official definitions of the Dublin Core metadata element set in found in “Description of Dublin Core Elements” – [ http://purl.org/metadata/dublin_core_elements/]. A mapping of DC elements to the USMARC fields is contained in “Dublin Core/MARC/GILS Crosswalk,” prepared by the Network Development and MARC Standards Office at the Library of Congress [ http://lcweb.loc.gov/marc/dccross.html]. The following discussion is based on these sources and discusses the use of DC information in AACR2 cataloging records.

TITLE
The TITLE element corresponds to the Title Proper (AACR2 9.1B1, USMARC 245$a). The source of information for the Title Proper is the title screen or other eye-readable information. Only when there is no eye-readable information can a title be transcribed from other internal evidence such as metadata in the file header. Therefore, the metadata TITLE will usually need to be compared with the eye-readable title before it can be accepted as the Title Proper. If it is different from the eye-readable title, the metadata TITLE would be recorded as a Variant Title (USMARC 246, with a caption “$iTitle from metadata:”).
CREATOR
The CREATOR element corresponds to the main and/or added entries. The USMARC Crosswalk maps this element to field 720, field 700 (if a personal name is specified) or field 710 (if a corporate name is specified).
For an AACR2-based description, the rules in Chapter 21 would need to be applied, and a main entry determined. If the CREATOR (or one of the CREATORS) is determined to be the main entry, a 1XX field would be used. If no name type is specified, the cataloger would have to determine whether the name was a personal or corporate name.
The content of this element may or may not conform to the rules for form of name in Chapters 22-24, and the name may or may not be consistent with the official form in the national authority file. In order to conform to AACR2 practice, authority work would need to be done. Since the USMARC field by itself does not indicate whether the content of the field is an authorized heading, it is particularly important that authority control procedures be built into any use of CREATOR information in cataloging records.
It should also me noted that the CREATOR element also corresponds to other AACR2 elements, such as the statement of responsibility (AACR2 9.1F, USMARC 245$c), credits note, etc. Although the DC element is not intended as a descriptive (as opposed to an access) element, the data given in the DC CREATOR element may be very useful in describing the responsibility for creation of the resource. Although it is most likely to be formulated as an access point (e.g., an inverted personal name), it may be transcribed in brackets in the Statement of Responsibility area or in a note.
SUBJECT
The SUBJECT element may contain various identifiers relating to the subject of the resource, such as keywords or classification notations. The default USMARC mapping is to field 653 (Uncontrolled subject access), although specific fields such as 650 for LC Subject Headings or 050 for LC Classification numbers may be used if the metadata include identification of such subject schemes.
This element does not involve descriptive cataloging covered by AACR2, but it should be noted that this is not a transcribed element. Therefore, it may be used without further modification. Its usefulness will be determined by the specificity of the scheme identification. In a catalog that uses controlled subject headings and classification, uncontrolled keywords will be less useful than controlled headings and classification.
DESCRIPTION
The DESCRIPTION element corresponds to a Summary note (AACR2 9.7B17, USMARC 520). As with the SUBJECT element, this is not transcribed data and therefore can be used without modification in a catalog record. The usefulness of the result will depend only on the quality of the abstract or summary.
PUBLISHER
The PUBLISHER element corresponds to the Name of publisher, distributor, etc. (AACR2 9.4D, USMARC 260$b). The prescribed source for this element in AACR2, like that for the title, gives precedence to eye-readable information, over information in the HTML source code. Therefore, the content of the PUBLISHER element would need to be verified; if there is no eye-readable publisher information, the metadata can be used.
CONTRIBUTOR
The OTHER CONTRIBUTORS element corresponds to the added entries. All of the points made under the CREATOR element above apply here, including the need for authority work and the use of this information as the basis for statements of responsibility and credits notes.
DATE
According to the Dublic Core definition, the DATE element contains “a date associated with the creation or availability of the resource. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985.” This corresponds in content (but not in form) to the Date of publication, distribution, etc. (AACR2 9.4F, USMARC 260$c). As with the PUBLISHER element, the prescribed source of information for this element gives priority to eye-readable information. The date would have to be verified and formated according to 9.4F. It should also be noted that the DC DATE element is not necessarily a date of publication; “creation or availability” can cover a multitude of sins, particularly when applied by non-catalogers.
Other dates may also be recorded in this DC element, such as the date of last update (which might need to be included in a “Description based on:” note). Since Dublin Core includes little information that explicitly addresses the distinction among versions of the same resource, this element may be the only source of such data, and the information may be decidedly inadequate for this purpose.
RESOURCE TYPE
According to the Dublin Core definition, the TYPE element contains “the category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types.” The USMARC Crosswalk maps this element to a Form/Genre term (USMARC field 655).
It should be noted that this data is relevant to several AACR2 elements. It is similar to the Designation element in the File characteristics area (AACR 9.3B1, USMARC 256). If the list of designations in 9.3B1 is expanded as a result of the ISBD(ER) harmonization, it will be important that the DC and AACR2 lists not be in conflict. RESOURCE TYPE data may also be relevant to the note on Nature and scope (AACR2 9.7B1a, USMARC 500).
FORMAT
According to the Dublin Core definition, the FORMAT element contains “the data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.” The USMARC Crosswalk maps this element to a subfield in field 856 (Electronic location and access).
In terms of AACR2, this element may contain data that could be included in a note on Nature and scope (AACR2 9.7B1a, USMARC 516) or on System requirements (AACR2 9.7B1b, USMARC 538). The information in the metadata would have to be rephrased when used in a note.
RESOURCE IDENTIFIER
According to the Dublin Core definition, the RESOURCE IDENTIFIER element contains a “string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.” The USMARC Crosswalk maps this element to the URL (856$u), although other elements can be used if the appropriate scheme is identified (e.g., ISBN in field 020). Although this is vital information about any Web resource, it is not governed by any AACR2 rules (except 9.8B for the ISBN or ISSN).
SOURCE
The SOURCE element contains information about “the work, either print or electronic, from which this resource is derived.” Although the USMARC Crosswalk maps this to field 786 (Data source), the data is covered by the note on Edition and history (AACR2 9.7B7, USMARC 500 or 533) and, in the case of serial publications, the note on Other formats (AACR2 12.7B16, USMARC 776). The content of the element may need to be modified to comply with the relevant rules and to assure that the related resource is correctly identified.
LANGUAGE
The LANGUAGE element corresponds to the Language note (AACR 9.7B2, USMARC 546), as well as to the coded language element in USMARC. The default mapping is to field 546, on the grounds that free-text information is most likely, but, if the USMARC coded scheme is identified, it can be mapped to the coded element in field 008. The content in the Language note may have to be modified to conform to the rules.
RELATION
The RELATION element contains data about the relation of the resource to other resources. This is a more general version of the SOURCE relationship and, like SOURCE, corresponds to the Edition and history and the Relationships with other serials notes. Again, the content of the element may need to be modified to comply with the relevant rules and to assure that the related resource is correctly identified.
COVERAGE
The COVERAGE element contains data about the chronological or geographic coverage of the resource. The default USMARC mapping is to field 500, but the data may be appropriate in a variety of notes. It may also serve as the basis for subject descriptors. Certain data producers and archivists have developed fairly detailed standards for defining the coverage, particularly of geo-spatial data, and there are specific USMARC fields for such data. Only general coverage notes are specified in AACR2, under the rule for notes on Nature and scope.
RIGHTS MANAGEMENT
According to the Dublin Core definition, “the content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources.” The USMARC Crosswalk maps this element to field 540 (Terms governing use and reproduction note) for which there is no corresponding rule in AACR2.

Rules in Chapter 9 of AACR2

Rule 9.0B1. The rule currently reads:

9.0B1. Chief source of information. The chief source of information for computer files is the title screen(s).
      If there is no title screen, take the information from other formally presented internal evidence (e.g., main menus, program statements, first display of information, the header to the file including “Subject” lines, information at the end of the file). In case of variation in fullness of information found in these sources, prefer the source with the most complete information.
      If the computer file is unreadable without processing (e.g., compressed file, printer-formatted file), take the information from the file after it has been uncompressed, printed out, or otherwise processed for use.
      If the information required is not available from internal sources, take it from the following sources (in this order of preference)
the physical carrier or its labels
information issued by the publisher, creator, etc., with the file (sometimes called “documentation”)
information printed on the container issued by the publisher, distributor, etc.
      If the item being described consists of two or more separate physical parts, treat a container or its permanently affixed label that is the unifying element as the chief source of information if it furnishes a collective title and the formally presented information in, or the labels on, the parts themselves do not.
      If the information required is not available from the chief source or the sources listed above, take it from the following sources (in this order of preference)
other published descriptions of the file
other sources

It is probably true that metadata falls under “other formally presented internal evidence.” This would mean that metadata could be used as a substitute for the title screen (eye-readable information).

Although it might be worth considering whether metadata might be given preference over the eye-readable information, it is probably unwise to compromise the principle of transcription and to prefer a hidden source to the public source provided by the eye-readable content of the file. Therefore, we do not recommend changing the first paragraph of 9.0B1.

On the other hand, the significance of metadata in the cataloging world probably warrants adding “metadata” to the list of “other formally presented internal evidence” in the 2nd paragraph.

Other rules: {We might want to propose adding some examples, particularly in 9.7B. Are there any other rules that we ought to look at?}

Back to: Task Force home page