What Is XML?

As implied by its name, XML is a markup language. It shares many characteristics with its more familiar cousin, the HyperText Markup Language (HTML), which has become wildly popular as the core technology enabling the World Wide Web and web browsers. The languages have common origins in document markup, a technique that is as old as the printing and publishing business. When a complex document, such as this book or a newsletter or a magazine, is to be printed, it can be thought of as having two related logical parts. The content of the document, which usually consists of text and graphics, contains its meaning. The structure of the document (titles, subtitles, paragraphs, captions) and the accompanying formatting (fonts, indentations, page layouts) help to organize the contents and ensure that they are presented in a meaningful way. Since the earliest days of printing and publishing, editors have employed markup symbols and formatting marks, embedded within the contents of the document itself, to indicate the document’s structure and how it should be formatted for printing.

When computerized publishing systems arrived on the scene, markup commands embedded within the contents of a document became instructions for the publishing software programs. Each type of publishing software or equipment had its own proprietary markup commands, making it difficult to move from one system to another. The Standard General Markup Language (SGML) was developed as a way to standardize markup languages, and eventually was adopted as an ISO standard. More precisely, SGML is a metalanguage for defining specific markup languages. Its inventors recognized that no single markup language could cover all of the possible markup requirements, but that all markup languages had common elements. By standardizing these common elements, a family of closely related markup languages could be created. HTML is one such markup language, focused especially on the use of hypertext to link documents together. XML is another such language, focused especially on strong typing and tight structuring of document contents. Their common roots in SGML make HTML and XML cousin languages, and account for their similarity.

Both HTML and XML are World Wide Web Consortium (W3C) recommendations, defined by specifications that are developed by, voted on, and then published by the W3C. The W3C is an independent, nonprofit consortium whose purpose is to develop and advocate the use of standards associated with the Internet and the World Wide Web. W3C recommendations have “officially adopted” status; the terminology means that the W3C organization advocates and recommends their use. Through this process, HTML and XML are vendor-independent industry standards.

HTML was the first SGML-based language to gain widespread popularity. The contents of every web page on every web site on the World Wide Web are expressed as an HTML document. Special markup elements, called tags within an HTML document, indicate graphical elements, such as buttons to be displayed by a web browser. The tags also describe the hypertext links to other documents that the browser should follow when a button is clicked. Other tags identify graphical elements that are to be inserted into the HTML text when it is displayed.

As the use of the World Wide Web exploded in the 1990s, HTML was rapidly adapted to display much richer content on highly formatted web pages. HTML tags were quickly invented to control the formatting of web pages, directing the display of boldface or italic text, centering and indents, and text location within the page. In some cases, these tags were even unique to a specific web browser, such as the Netscape browser or Microsoft’s Internet Explorer. Over time, a great deal of the markup within an HTML page became focused on formatting and presentation of information. This had the benefit that web page formatting was tightly specified, so pages tended to be displayed in the same way regardless of the browser or device on which it was displayed.

It had the disadvantage that the logical structure of web page content tended to get lost in the formatting and presentation details.

An important original goal of SGML was that a given logical element, such as a page title or a web page subsection, could be consistently identified across hundreds of documents (for example, across hundreds of pages on a web site). A simple directive to the browser, such as “display all subsection titles in blue, boldfaced, 16-point Times New Roman font,” would then ensure consistent presentation of all pages. Instead, web page authors tended to explicitly mark every element, such as those subsection titles, with its own detailed formatting instructions. These could easily become inconsistent, and worse, a change to the formatting instructions would require hundreds of individual page edits rather than being specified once for all pages.

One of the main driving forces behind the development of XML was to restore a more logical-level, rather than formatting-level, approach to markup. XML implements much more rigid rules about document structure than HTML. Most of its components and capabilities are squarely focused at representing logical document structure. Companion standards, such as XML Schema, which specifies types of documents, extend this focus of XML even farther.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

