<? Cave Survey Data in XML ?>
Worldwide, cave surveys collect and store masses of data in a variety of electronic formats each year. Frequently the particular software product used to render the data drives the format. This limits interoperability between software products because data, the common denominator of all applications, is not immediately readable from one program to another. As a result the only solutions to moving data from one program to another include: re-keying of survey data, writing new software to adapt to a legacy format, or the development of data translation utilities.
In order to stimulate discussion and cooperation within the community of people interested in the collection and electronic storage of cave data this article gives a brief introduction to XML fundamentals. Building upon those fundamentals these writings further introduce a simple XML document intended to support discussion and provoke thought toward the creation of an XML standard for the cave surveying community.
The goal of data entry should be the complete and accurate electronic representation of raw data for the purpose of communication, processing and archival. Placing controls upon the practice of data representation, as it concerns a logical class or group of data, is a fundamental tenet of data management. A cooperative effort to establish the norms of data representation is necessary if the quality of data as a whole is to improve.
The rapid spread of XML creates a new opportunity to store, share and represent cave survey data across existing and future software products. XML stands for eXtensible Markup Language and is a human readable tagging language much like HTML (of internet fame). Unlike HTML however, which focuses on the display of data and how it looks, XML was designed to structure, store and send data. XML focuses on what the data is, not how it appears. XML is about describing information and is not a replacement for HTML.
Storing data in XML creates information that can be read by many different types of applications. XML stores data in plain text files and this simplicity makes it easy to exchange data between incompatible systems. XML supports Unicode, therefore making the data internationally transportable. Since XML is independent of hardware, and software, your data is made available to a wider audience of existing (and yet to exist) applications. Other clients and applications can access your XML files as data sources, as if they were accessing databases.
XML can also be used to store data in databases. Applications can be written to send and retrieve information from the store, and generic applications can be used to display the data. Finally, XML can be used to create new languages. For instance the Wireless Markup Language, WML, is used to markup Internet applications for handheld devices like mobile phones. WML is written in XML. A whole host of possible acronyms come to mind for a caving markup language, the most obvious, CaML.
But XML does have its downside. It can be verbose, in some cases XML doubles or triples the size of the data file. But careful planning can minimize this aspect of XML, and the benefits of a common markup framework far outweigh this small draw back. The issues of hard disk space, memory constraints or bandwidth are laughable at best with today's rapid pace of technology. But for all the benefit XML delivers, it still doesn't solve the human issue of "cooperation". In order for an XML based markup language to serve a community of users well, the members of that community must develop their standard cooperatively.
An example of XML
The following XML document is a simple note to Jane from Jack, stored as XML:
Exhibit 1. A simple note expressed in XML
The note has a header, and a message body, indicated by the <heading> and <body> tags. It also has sender and receiver information, following the <to> and <from> tags. The XML file is just pure information wrapped in XML tags and incapable of doing anything on its own. Someone must write a piece of software to do anything with it, but because the data is stored in a common form, several software developers can write complementary applications based upon a common starting point.
An example of how cave survey data could be represented in XML follows a similar form to the note above. Consider exhibit 2, an example of real cave survey data stored in XML, but reduced in scope for brevity. Notice that XML documents use a self-describing and simple syntax.
Exhibit 2. Cave survey data stored in XML
The first line in exhibit 2, the XML declaration, defines the XML version of the document. In this case the document conforms to the 1.0 specification of XML. The next line describes the root element of the document, <caveSurvey>. This is the equivalent of declaring this document to be a "cave survey".
The next six lines contain six child elements of the root (caveName, surveyName, surveyDate, surveyTeam, surveyComment and surveyData). They identify elements of data commonly collected during a cave survey.
Exhibit 3. XML elements describing typical cave survey data
The <surveyData> element, a child of the root element <caveSurvey>, also has children of its own. In this case those elements are shots, and each <shot> element in turn has its own children. The five lines after each <shot> tag describes elements of a typical compass and tape survey shot (Exhibit 4).
Exhibit 4. Data common to a compass and tape survey captured in XML
In each of the previous exhibits the end of each element is indicated by a tag with the same name as the lead element, but prefixed with a slash inside the brackets. For instance the <shot> element is ended by the closing tag </shot>. A second <shot> element immediately after this marks the beginning of a new series of data describing the next shot in the survey. In a real survey file this would continue until all shots associated with this particular "<surveyName>" had been recorded.
Finally, the last line of exhibit 2 defines the end of the root element, </caveSurvey>, and the end of the XML data file. Even with an application no more sophisticated than a simple text editor it is a pretty simple thing to interpret the contents of an XML data file.
XML elements are extensible.
One of the strongest benefits of XML is the ability to extend documents to carry more information. Look at the following expanded cave survey example:
Exhibit 5. XML documents can be extended to carry additional data
Imagine that an application has been designed to work with the earlier simple XML example and it extracts the <fromStation>, <toStation>, <distance>, <foreAzimuth>, and <foreInclination>. With that data the application performs the typical functions of a cave data reduction program. Now imagine new standards of cave survey have created the need to include additional data in the XML file. This results in the second, expanded cave survey data file. The author of the new XML file rearranges the schema somewhat and includes <azimuth>, <inclination>, <backAzimuth>, <backInclination>, and <shotComment> tags and data.
Should the changes in the new data format cause the original application to crash? No, the original application can still find the elements it needs: <fromStation>, <toStation>, <foreAzimuth>, and <foreInclination>. It cannot however take advantage of the new data added to the XML file. That is a function of the new software program that reads the XML file.
DTD's and their role.
Because XML is free and extensible, XML tags are not predefined. Instead, you must define your own tags. This is in contrast to HTML, where the tags are defined by standards (e.g. HTML 4.0), and the author of an HTML document can only use tags defined in the standard. The tags in the cave survey example above (e.g. <caveSurvey> and <foreInclination>) are not defined in any XML standard. Instead these were created for the purpose of this article. In order for these tags to become widely useful however the tags and their relationships need to be recorded.
When an author (or community of authors) defines the elements and structure they intend to use within a document those decisions must be preserved for later reference. This record is kept in a DTD (Document Type Definition). XML uses DTD's to describe elements of data and their relationships. An XML document that complies with a particular DTD is self-describing and "valid".
Exhibit 6. A DTD can be used to validate the form of an XML data.
The first line in exhibit 6 uses the statement - "<!DOCTYPE caveSurvey [", to declare the document to be a type of caveSurvey. The line after this defines the element caveSurvey and enumerates its children:
Exhibit 7. The valid elements of caveSurvey are enumerated in the DTD.
Note that the last element in this group, surveyData, is post-fixed with an asterisk. This indicates that surveyData itself has child elements. In this case the child is the element "shot", which has children of its own as indicated by another post-fixed asterisk.
Exhibit 8. The DTD lists children of shot, and illustrates shot's relationship to surveyData.
These first three statements (caveSurvey, surveyData and shot) define the hierarchical structure that an XML document must posses before it can be validated by this DTD. After these structural statements are made, the DTD goes on to define of the specific elements and the type of information they will carry.
Exhibit 9. The DTD also identifies the type of data carried by each element.
By now Element and caveName should be self-explanatory. PCDATA means "Parsed Character Data" and indicates to the XML reader how the information in the caveName element should be handled. Another possible format is CDATA, or Character Data, which is handled differently by the XML reader.
DTD statements such as those in exhibit 6 can be associated with an XML file via one of two methods. For small, relatively simple XML documents it's easiest to include the DTD within the XML document itself as header information, similar to the following example.
Exhibit 10. DTD statements can be included within the XML data file.
For more extensive DTD's it may be desirable to place a reference inside the XML file to a DTD located in another file, as in the following example.
Exhibit 11. XML files can reference DTD statements in external files.
The DOCTYPE attribute - SYSTEM "caveSurvey.dtd" - informs the XML reader that it can find the DTD to validate this XML file on the local system. This avoids the need to repeat the same DTD information in every XML file thereby reducing file size. It can also be used to refer to a DTD file located on another computer entirely, as in the DOCTYPE declaration in exhibit 12.
Exhibit 12. DTD statements may be retrieved from remotely located machines.
This statement would allow the XML reader to retrieve the appropriate DTD for application to this XML file from a web server on the network.
DTD's are much more flexible than this short description can convey and in the process of developing a cave survey DTD the caving community will likely need to use more advanced capabilities of the DTD in order to accomplish their objectives.
What remains to be done?
So what next? How do we get there from here?
First, popular support is critical to the success of any form of community endeavor. Significant input from the community of persons interested in cave survey data is sought and will be shared on a range of topics as they arise. Further details to support these endeavors are available on the web site www.psc-cavers.org/xml, including examples, links to tutorials, tools and discussions.
Second, suggestions are needed from the community for the "things" they think a cave survey data file should include. This is a simple list that is not immediately concerned with how those "things" are ultimately arranged in XML format. The goal is to brainstorm those elements that are most critical to cave surveys and will result in a flexible standard to meet the community's needs.
Third, submissions for candidate arrangements of the data elements are needed. Those submissions could take the form of raw XML files, as used above, or DTD statements, which drive the design of the XML file. Within this phase it would be appropriate to discuss issues of style, formatting, abbreviations, use of attributes, etc.
XML is going to be everywhere (and indeed its use is already widely spread). The XML standard has developed swiftly, and a large number of software vendors have been quick to adopt it. XML holds the promise of becoming the most common technique for all data storage and data transmission needs.
Storing cave survey data in XML makes it more valuable to cave science because it becomes future-proof. Surveyors can position themselves to develop increased capabilities in the future by contributing to a joint effort to develop the XML representation of cave survey data today. For further discussion of development of cave survey data standards in XML please visit the web site www.psc-cavers.org/xml and review the information it contains.
-Devin Kouts, November 26, 2000