Terry Vaughn
School of Information
The University of Texas at Austin
INF 385E - Turnbull
1. Introduction
2. Past Development
3. Current Applications
4. Metadata for the Masses
5. The Semantic Web
6. Conclusion
The Internet – particularly its diverse collection of resources stored on the World Wide Web – has been described as the world's digital library, but this analogy does not stand up to serious scrutiny. Every resource in a library has been classified, cataloged, and then carefully placed in a spatial organization system of shelves, racks, and bins. Such rigor has yet to be employed with the vast majority of the Web. This boundless sea of information contains not only books and papers, but raw scientific data, images, audio and video recordings, animations, and countless other manifestations of information that have yet to be described, much less organized.
At the moment, computer technology bears most of the responsibility for organizing information on the Web. Unable to cope with the volume of information, humans have created systems that automatically classify and index resources on the Web. Developed mainly by computer scientists with little knowledge of library science, today's systems lack the neat bibliographic and syndetic structure that guides people to the information they seek (Svenonius, 2001, p. 64). In order for the Web to advance as an information platform, practices taken from library science must be implemented to improve (or in some cases limit) access to its resources. Foremost, resources on the Web must be described like resources in a library's catalog. Once described, a resource is much easier to organize, control, and preserve.
Literally meaning data about data, the term metadata refers to the description of a resource. A metadata record consists of a set of attribute descriptions that stand in place of a resource or collection of resources. It serves as a surrogate record that enables people to find salient information about a resource without searching through countless irrelevant full texts (Taylor, 1999, p. 78). Metadata can describe three aspects of a resource: content, which describes intrinsic information such as subject matter, audience, and usage; context, which describes extrinsic information such as relationships to other resources (Gill, 2000); and composition, which describes physical properties of the resource such as format and file size. Metadata can be embedded in a resource itself or contained in a separate record. Most importantly, the metadata record must follow a standard syntax and structure because automated systems (such as Web agents) rely on standards to interpret data on the Web. Despite the public availability of metadata standards, the majority of Web resources lack the metadata necessary for systems to reliably extract the routine information that a human indexer might find through a cursory inspection: author, date of publication, length of text and subject matter. A Web agent might turn up a desired article about Dublin, Ohio, but it might also find thousands of other articles in which Dublin, Ireland is mentioned in the text or in a bibliographic reference.
The future of the Web will rely upon many technologies, but metadata is at its core. Metadata will enable information to be shared and processed by automated systems as well as people (Berners-Lee & Miller, 2002). Moreover, it will allow us to "explain" to these systems semantic relationships that exist among different sets of data. In this Semantic Web, a Web agent will know the difference between Dublin, Ohio and Dublin, Ireland. Berners-Lee and Miller go further to explain:
This notion of being able to semantically link various resources (documents, images, people, concepts, etc) is an important one. With this we can begin to move from the current Web of simple hyperlinks to a more expressive semantically rich Web, a Web where we can incrementally add meaning and express a whole new set of relationships (hasLocation, worksFor, isAuthorOf, hasSubjectOf, dependsOn, etc) among resources, making explicit the particular contextual relationships that are implicit in the current Web. This will open new doors for effective information integration, management and automated services (¶ 5).
Metadata, as we think of it today, came about in the late 1960's at the confluence of library science, information science, and computer science (Taylor, 1999, p.52). At the Library of Congress, Henriette Avram developed the MARC (MAchine Readable Cataloging) format, which enabled computers to read bibliographic records. The MARC format assigns numeric codes to the different attributes of a resource. For example, MARC code 100 identifies a resource's author. Like most computing technology of that era, MARC, with all its esoteric codes worked well with computers, but required specialized knowledge of its catalogers.
As a result, more people-friendly encoding schemas arose with the development of Structured Generalized Markup Language (SGML). SGML has been accepted as an international standard for document markup. It specifies rules for creating markup languages that describe the structure of a document so that documents may be interchanged across computer platforms. Instead of using codes, SGML applies tags to discrete pieces of information. For example, an author's name is preceded by the tag, "<author>" and is followed with the tag, "</author>".
Since SGML is fairly complex, a subset of the language known as XML (eXtensible Markup Language) has become the de facto markup language for the Web. XML was created so that richly structured documents could be used over the web. According to Walsh (1998), "XML specifies neither semantics nor a tag set. In fact, XML is really a meta-language for describing markup languages. In other words, XML provides a facility to define tags and the structural relationships between them" (¶ 6).
An advantage of XML is that even though it is machine readable, it is also human-legible. XML is clear and concise, and it is relatively straightforward to make an XML document (W3C). As for metadata, it can be encoded in the same XML document it describes.
As a subset of XML, HTML (HyperText Markup Language) predefines a tag set and each tag's corresponding semantics. HTML provides a means to embed metadata using its meta tag. The meta tag is used to embed document information not defined by other HTML elements. The meta tag can be used to identify properties of a document (author, expiration date, a list of key words, etc.) and assign values to those properties. Unfortunately, Web publishers have abused the meta tag by falsifying its properties to skew search engine rankings, so most search engines no longer rely on meta tag information to index resources. Nevertheless, meta tags can still be a powerful tool within the controlled environments of intranets and content management systems.
In addition to HTML documents, the Web holds of a universe of diverse resources yet to be described: proprietary document formats, images, audio, video. It would be impossible for specialists to describe and catalog the Web, so a need was recognized in the mid-to-early 1990's for a simple standard that people could employ to describe the resources they publish on the Web. Out of that need the Dublin Core Metadata Initiative (DCMI) established a standard for describing a wide range of resources. The Dublin Core standard has two levels: Simple and Qualified. Simple Dublin Core consists of fifteen elements, while Qualified Dublin Core includes one additional element; among those, several have qualifiers that refine the semantics of the description (Hillmann, 2003).
Dublin Core can be implemented in any markup language a person may choose. HTML can be used to express Dublin Core, but RDF/XML is a more flexible schema for describing resources. According to Hillmann in the Dublin Core Usage Guide (2003):
RDF (Resource Description Framework) allows multiple metadata schemes to be read by humans as well as parsed by machines. It uses XML (EXtensible Markup Language) to express structure thereby allowing metadata communities to define the actual semantics. This decentralized approach recognizes that no one scheme is appropriate for all situations, and further that schemes need a linking mechanism independent of a central authority to aid description, identification, understanding, usability, and/or exchange (Sub-section 2.2, ¶ 1).
The ultimate goal of Dublin Core is to provide an application template or form that Web publishers fill out each time they upload a resource to the Web. To further improve resource discovery, DCMI also recommends using terms from a controlled vocabulary like the US Library of Congress Subject Headings (LCSH) and the US National Library of Medicine Medical Subject Headings (MeSH) in the content data of some elements.
So far, the Web has mostly been a medium composed of documents by humans, for humans rather than data for machines to be processed automatically (Berners-Lee, Hendler & Lassila, 2001, p. 2). Current Web pages mix content data with presentation data so agents have to strip out formatting, pictures, and advertisements to get to the real content. This process is prone to error (Berners-Lee & Miller, 2002, ¶ 4). The original purpose of the Web and its underlying data structure was to convey the meaning of information, not its presentation (Nielsen, 2000, p.36). The Semantic Web aims to make up for lost ground by assigning well-defined meaning to information so agents can carry out sophisticated tasks for people. That meaning will be enabled by metadata.
Three fundamental technologies for developing the Semantic Web already exist: Universal Resource Identifier (URI), which represents resources on the Web, RDF, and XML. URIs provide the ability to uniquely identify and express relationships among resources on the Web. XML provides a syntax for interoperability. RDF provides a model that leverages URIs and XML to express meaning (Berners-Lee, et al., 2001, p. 3).
The real power of the Semantic Web will be realized when people create applications that collect Web content from diverse sources, process the information and exchange the results with other programs. The effectiveness of such software agents will increase exponentially as more machine-readable Web content and automated services (including other agents) become available. The Semantic Web promotes this synergy; even agents that were not expressly designed to work together can transfer data among themselves when the data come with semantics.
As we have discovered in this paper, the Web has a long way to go before it will reach its full potential (as we can conceive of it now). The Web and metadata both rely upon standards to advance. As standards are adopted, the Web will become a more efficient medium for information organization, access, and control.
Services like Google automatically index resources on the Web, but their vast scope and inability to grasp semantics leaves much to be desired. Richer records, created by content experts, are necessary to improve search and retrieval. Formal standards (such as the TEI Header and MARC cataloging) provide the necessary richness, but such records are time consuming to create and maintain, and hence may be created for only the most important resources. They are way beyond the realm of the lay Web publisher.
An alternative solution that promises to mediate these extremes involves the creation of a record that is more informative than an index entry but is less complete than a formal cataloging record. If only a small amount of human effort were required to create such records, more resources could be described, especially if the author of the resource could be encouraged to create the description. And if the description followed an established standard, only the creation of the record would require human intervention; automated tools could discover these descriptions and collect them.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The semantic
web. Scientific American.com. Retrieved September 17, 2003, from
http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-
84A9809EC588EF21&catID=2
Berners-Lee, T., & Miller, E. (2002, October). The semantic web lifts
off. ERCIM News Online Edition, 51. Retrieved September 17, 2003, from
http://www.ercim.org/publication/Ercim_News/enw51/berners-lee.html
Gill, T. (2000). Defining metadata. Retrieved October 21, 2003, from Scientific American Web site: http://www.getty.edu/research/conducting_research/standards/intrometadata/
Hillmann, D. (2003). Using Dublin Core. Retrieved October 21, 2003 from
DCMI Web site:
http://dublincore.org/documents/usageguide/
Nielsen, J. (2001). Designing web usability. Indianapolis, IA: New Riders Publishing.
Svenonius, E. (2001). The intellectual foundation of information organization.
Cambridge, MA:
The MIT Press.
Taylor, Arlene G. (1999). The organization of information. Englewood, CO: Libraries Unlimited.
Walsh, N. (1998). What is XML? Retrieved October 21, 2003, from O’Reilly
XML.com Web site:
http://www.xml.com/pub/a/98/10/guide1.html