ThML: Theological Markup Language

For the Christian Classics Ethereal Library

Harry Plantinga

Version 0.93, August 19, 1998

Preliminary version: still under development


This document describes the Theological Markup Language (ThML), a markup language for theological texts. ThML was developed for use in the Christian Classics Ethereal Library (CCEL), but it is hoped that the language will prove useful for other applications as well. Key design goals are that the language should (1) represent information needed for digital libraries and for theological study involving multiple, related texts, including cross-reference, synchronization, indexing, and scripture; (2) be based on XML and usable with World Wide Web tools, and (3) be easy to learn and use. ThML is defined as an extension of the HTML 4 Strict DTD.


The study of theology involves uses of texts that are infrequent in other areas of study. Theological books usually make many references to the Bible: quotations, commentary, explanations, citations, and the like. Special processing for scripture references can greatly aid study. Theological study also often involves ancient texts available in multiple variations or translations, which may have to be synchronized and displayed in parallel columns. It involves the use of cross-reference systems such as Strongs numbers, various sorts of indexes, and the synchronization of multiple texts in various ways, as for example layers of commentary on a text. Theological study often makes use of several texts related by subject or scripture reference; tools that support library-wide searching by subject or scripture reference are also useful.

Existing markup languages are not well suited for non-commercial theological texts. Word processor formats don't represent semantic information about a text--an area in which HTML is also weak. This has a number of drawbacks--for example, that searching, indexing, and converting to other formats are more difficult. The Text Encoding Initiative (TEI) language is semantically rich for literary analysis but not easy to learn or tuned for theological study. It doesn't offer special handling of scripture references or Strongs-like reference systems, for example. Also, the language is very large and the overhead required to learn and handle the language is high. Commercial formats, including STEP and the Logos Library System (LLS), are not designed for integration with the World Wide Web, and preparing texts for these systems requires expensive software, beyond the means of most individuals. Publication in one of these formats may also be controlled by the company or consortium in question. As a result, few public-domain or on-line texts are available in these formats.

This paper describes the Theological Markup Language, or ThML, which is a markup language for theological texts designed for use in the Christian Classics Ethereal Library (CCEL), an experimental theological library on the Internet. ThML borrows some elements from TEI, and it is also designed to handle all of the semantic information in STEP-format documents (version 0.9). Electronic texts for the CCEL will be prepared in Microsoft Word, using custom software to convert to XML. These XML files may be used directly by XML-aware applications or converted to formats such as HTML webs, plain text files, PDF, and others.

Designing a Theological Markup Language

A language for theological study must handle the markup of text into headings, paragraphs, block quotes, emphasized text, and other basic structural elements that are common to all books and can be represented in a markup language such as HTML. Markup needs for theological study that go beyond these basics include the special handling of scripture references, numbering and synchronization schemes such as Strongs numbers, handling multiple versions or translations of the same text, handling footnotes, index entries, lexicons, and representing page breaks in the original text.

For the use of digital libraries, bibliographic data about the text should be represented. In fact, bibliographic data about two texts may have to be represented: the electronic text and the book from which it originated. Such a language should also represent subject classifications, edit history, and other relevant meta-data. If the language also represents scripture references, subject index entries, and names, it is possible to build indexes for individual books as well as library-wide indexes of these references.

The language should be rich enough to support conversion to other electronic formats that may be needed. And, especially now that it is possible to print books with very short run and one-off printing presses, the language should be rich enough to support high-quality typesetting of the book. Finally, the language should be easy to learn and use and able to be processed with inexpensive, widely available tools. To top off the list of desirable features, making the language extensible and programmable would enable users to address additional needs.

Since the primary means of delivering digital libraries such as the CCEL are the world wide web and CD-ROM, the language should use web-based technology, namely XML and Unicode, and be usable with web browsers. Books may also be typeset. Therefore, the design goals for ThML are these:

It is often considered a good language design to use purely semantic markup and restrict description of appearance to stylesheets as much as possible. This course gives several benefits: it makes documents more flexible, allowing different stylesheets for the web, typesetting, large-type editions, etc. It makes it possible to customize the presentation of the document for a particular nationality or group such as the vision-impaired. It forces the mark-up editor to think more about the semantics of the markup, rather than just trying to "make it look right" for a particular use.

This principle has been followed as much as possible. Still, it is important to note that these considerations all have to do with typesetting or presentation, a secondary concern for ThML. The raison d'être of ThML is to assist in the construction of a rich digital library and to assist in searching and synchronizing documents. This involves representing links, relationships between texts, names and dates, etc. For the purposes of a digital library, knowing that a word is a name, for example, is valuable for searching, even though it may not be typeset specially. Knowing that a phrase is italicized because it is emphasized or ironic is of little value for the purposes of the digital library. Thus, it is not always necessary to know the semantic reason for a change in appearance for the purposes of ThML. The choice was made in some cases to limit semantic markup to what was needed for the digital library and for theological study and to use appearance-oriented HTML markup elsewhere.

Theological Markup Language

Based on XML and HTML

Since the Theological Markup Language will be based on HTML and XML, it will support all of the markup of HTML, a rich linking language in XLink, and stylesheet support in XSL. HTML may be used for markup of emphasis, paragraphs, headings, lists, tables, block quotes, images and multimedia, scripts, etc. Links may make use of the extended pointer and link types associated with XML, and formatting will be specified in XSL. These facilities will be used wherever possible, to make the language easier to learn for those who already use HTML and easier to use with the World Wide Web.

The Theological Markup Language is defined as a set of extensions to the HTML 4 Strict DTD. The DTD for XML is available at Since the ThML DTD includes HTML 4, it is technically an SGML language and not an XML language. However, it is intended that the extension to HTML are compatible with XML. It may be possible to validate a ThML document with and SGML parser and use it as well-formed XML.

Document Structure

ThML documents are contained in one global ThML element, and like HTML documents, they contain a head and a body section.



<ThML.head> </ThML.head>

<ThML.body> </ThML.body>


The ThML.body element contains all of the contents of the print edition of the book on which the text is based. The ThML.head element contains bibliographic and meta information about the text, both the electronic publication and the print edition upon which it is based, if any. The information in this section is not generally a part of the original book but taken from MARC records or added to document the publication of the electronic version. It may also include keywords, information about the program used to generate the text, etc.

Divisions of the Text

Structural divisions in the body of the text are marked with <divn> tags as in this example:

<div1 title="The Imitation of Christ">

<div2 type="Book" title="Admonitions on Things Internal" n="2">

<div3 type="Chapter" n="1" title="Of the Inward Life">

<div4 type="Section" n="I">

</div4> </div3> </div2> </div1>

<div1> is used for top level parts of a text, including, for example, a title page, preface, table of contents, chapters, index, etc. Additional levels are used for lesser structural divisions of the document. These structural divisions show the structure of the original text, and they are also used to prepare a table of contents and allow splitting of a text info files and access to a text by section. The optional title attribute is used in constructing a table of contents and may be used for running heads or other identification purposes. The optional type and n attributes may be used to specify the type and number of the division. If they are present, they will also be used to identify the section in the table of contents.

The <insertContents level="3" /> tag may be used to specify that a table of contents be inserted at that location. In the example above, the title attribute of the <div1> tag would be used as the title of the Table of Contents, and all of the all of the <div1> and <div2> entries would be gathered and listed hierarchically. Each entry would be linked to the appropriate section.

Page Breaks

It is often useful to know the page breaks from the print edition of a book. They may be used as targets for subject index entries identified by page number or to display a text with the pagination of the print edition. Page breaks are marked by the insertion of <pb /> tags, with the n attribute giving the page number of the upcoming page (<pb n="37" /> or <pb n="xii"/>). These elements should appear at the start of the identified page. Many electronic texts will also have images of pages available on line. The pb element will also take an href attribute specifying a URI for an image of the page (<pb n="37" href="gif/0021a.gif" />).

Scripture References

In theological texts, scripture passages may be cited, quoted, or explained. Citations refer to a passage, but quotes include the text of a passage in the document. In that case, presentation software should provide an easy way to see the cited passage in a specified translation or a translation of choice -- perhaps with a hypertext link or separate, synchronized window of notes and references. It is also helpful to provide an index of scripture quotes and citations for a text and also perhaps for a whole library.

Scripture references may occur in footnotes, in parentheses (Phil. 2:1-11), or in the text itself -- see Rom. 8:28. Context may be needed in order to interpret a reference -- see verse 29 and 10:8-13. Several passages may be stacked together in one citation (Matt. 5:44, 46; Luke 7:42; John 5:42, 13:35, 14:15, 23; 15:12-13; 21:15-16; 3 John 13). For marking scripture citations, ThML will use the <scripRef> element, as in this example:

<scripRef passage="Rom. 8:27,28; 10:8-13" version="NIV">Romans
8:27,28; 10:8-13

The version attribute specifies the translation or version, and the passage attribute is a list of scripture references separated by commas or semicolons. Each reference may consist of a book name (or abbreviation), a chapter, and a verse. The chapter and verse are separated by a semicolon or period. If the book name or chapter are missing, they are assumed to be the same as the previous reference. If two references are separated by a dash, all of the intermediary verses are included as well. In the case of books with only one chapter, a reference consists of a book name or abbreviation and a verse. Book names should be as they appear in the version in use or a unique prefix of at least two letters of the name. Abbreviations that are not prefixes may also be accepted by programs that process ThML documents.

Software for processing ThML texts will likely have a scripture parser incorporated that finds scripture citations and marks them appropriately, so that it will not be necessary to mark them all by hand. However, parsing text to find and identify scripture references involves several difficulties. One problem is that different translations of the Bible use different versification schemes. For example, Psalm 9 in the King James Version is split into two -- Psalms 9 and 10 -- in the Septuagint. In order to interpret a reference, the versification scheme used must be known. Scripture references will be assumed to be compatible with the versification scheme used by the KJV, ASV, NASB, NIV, and TLB unless otherwise specified.

Also, context is sometimes necessary in interpreting a reference. A passage may refer to Romans 8:28 at one point and later to verses 29 and 30 and chapter 10:8-13. A parser should be able to identify the context in most cases, but in some cases it may be necessary to set the context or turn the parser off. The <scripContext version="NIV" passage="Romans 8" /> element is used to set the default context for the parser, and the <scripParseOff /> and <scripParseOn /> elements may be used to turn the parser off or on, to prevent linking of a passage such as "Bob had 2 apples and John 3." The version attribute may be set in a scripContext element but it is never set by the parser.

In theological texts, scripture is also sometimes quoted. In this case, it is not desirable to link the reference to the scripture passage, but it may be desirable to incorporate the passage into a table of scripture references. Quotations of scripture may be marked with the <scripture> element. A passage may be represented as in this example:

<scripture passage="Mark 7:16" version="NKJV">If anyone has ears to hear, let them hear!</scripture>

This markup may also be used for a translation or version of a book or a whole Bible, perhaps as in the example below. Scripture marked in this way could be automatically retrieved by book, chapter, and verse with an appropriate program.

<scripContext version="Calvin's Translation, in English" />

<scripContext passage="Romans 8" />

<scripture passage="28">We further know, that to those who love God all things co-operate for good, even to those who are called according to <I>his</I> purpose:</scripture>

<scripture passage="29">for those whom he has foreknown, he has also predetermined to be conformed to the image of his Son, that he might be the first born among many brethren;</scripture>

Explanation or commentary on a passage involves a semantic relationship between the explanation and the passage explained. This relationship should be represented in the text in order to be able to build an index of scripture commentary. For example, it would be useful to be able to see everything the early church fathers, said or preached on a passage. Commentary or explanation of a passage will marked with a <scripCom> element, as in this example:

<scripCom passage="Mark 7:16">Mark 7:16. This admonition seems to apply to most everyone . . .</scripCom>

Cross Referencing Schemes and Synchronization

Cross referencing is the ability to find related passages in separate texts. Cross referencing in theological texts takes many forms. They include simple links such as those that can be handled by an HTML anchor; numeric or symbolic indexing schemes such as dates, Strongs numbers, scripture references, or subject index entries; annotation such as footnotes and commentary; different translations of the same text, etc. In this section we will define markup for handling symbolic cross-referencing schemes other than those that can be handled as ordinary links, scripture references, or annotation.

Standardized symbolic cross reference schemes such as dates, keywords, or Strongs numbers aren't really links to other documents, because any two documents with compatible cross reference schemes can be linked together and no particular documents are intended. Therefore XLL links, element IDs, etc. don't capture the semantics of such information. We will define a new sync element to represent this information. For example, the element

<sync type="Strongs" value="G42" />

might be used to represent a Strongs number at a location in a text.

Software tools may be provided to use this information in a variety of ways. For example, a program would be able to find other passages on related topics or create an index using the Strongs numbering. Multiple different manuscripts of the same original text could be aligned this way, and displayed in parallel columns, with appropriate software.

The scheme name given in the type attribute are not pre-defined; applications may invent new synchronization types for specific purposes. For example, the Rule of Benedict is available in several different manuscripts in two different traditions. If a common synchronization scheme were defined and manuscripts marked up, any two or more could be selected and aligned as parallel columns on the screen, or alternate forms of a passage could be located.


Footnotes or endnotes occur frequently in books and are not well supported in HTML. A common strategy for handling notes is to store them in a separate file, with links back and forth between the text and the notes. A drawback of this approach is that to see a footnote it is necessary to unload the current page and load a page of footnotes -- and reverse the process to get back to the main document. This process is slow, and little semantic information about the relationship between the text and the notes is stored.

In a ThML document, footnotes, endnotes, etc. are all marked with the <note> tag, following the syntax used by TEI Lite for the most part. The note element may take the following attributes: place, resp, target, targetEnd, and anchored. The place attribute specifies how it appears in the text (e.g. end, foot, inline, or margin). The target (and targetEnd) attributes refer to the start (and end) of the text being annotated, if the note does not occur in the text at its reference point. These attributes allow the notes to be gathered at the end of a chapter or file if desired. The resp attribute identifies the person responsible for the note -- for example, the author, editor, or a person's initials. The anchored attribute make take the value yes (default) or no, specifying whether the note is anchored at an exact location; margin notes typically are not anchored.

The <note> element can also be used to store commentary, margin scrawls, and the like for a text in a separate file. In that case, the target and targetEnd attributes would be references to a point in another document.

Foreign Languages

The primary language for a document is specified in the header. Passages in other languages may be marked with the foreign tag and the lang attribute. For example, the Greek passage <foreign lang="el">logos</foreign> may be marked as shown. "lang" attribute values are as specified in ISO 639. Some examples are Dutch: nl, English: en, French: fr, German: de, Greek: el, Hebrew: he, Latin: la, Spanish: es, Portuguese: pt, Russian: ru. Note that the lang attribute may be used with most elements.

If the language uses characters not available in the ISO-8859-1 (Latin-1) character set, they may be represented in Unicode (preferred), as in this Greek example (λογος) and this Hebrew example (הלהי), or using the Latin-1 character set and a suitable font, for example, <foreign lang="el" style="Font-family: SIL Galatia">logov</foreign>. The Greek and Hebrew fonts recommended for use with the CCEL are the freeware SIL Galatia and SIL Ezra fonts and related software from the Summer Institute of Linguistics, used here in a Greek example (logov) and a Hebrew example (hwhy).


Theological books often contain verse -- poetry, hymns, or versified presentation of material such as the Psalms. A stanza, verse, or other unit of verse is encoded in a <verse> element. Verse is often written with varying levels of line indentation. Lines are marked with <l>, <l2>, and <l3> elements, identifying relative levels of indentation. In the example below, the indentation is of course ignored by the XML parser, but it should be reproduced by the presentation software based on the <l>, <l2>, and <l3> tags.


<l>O God, a world of empty show,</l>

<l2>Dark wilds of restless, fruitless quest</l2>

<l>Lie round me wheresoe'er I go: </l>

<l3>Within, with Thee, is rest.</l3>


<l>And sated with the weary sum</l>

<l2>Of all men think, and hear, and see, </l2>

<l>O more than mother's heart, I come, </l>

<l3>A tired child to Thee. </l3>


<l>Sweet childhood of eternal life! </l>

<l2>Whilst troubled days and years go by, </l2>

<l>In stillness hushed from stir and strife, </l>

<l3>Within Thine Arms I lie. </l3>


<l>Thine Arms, to whom I turn and cling</l>

<l2>With thirsting soul that longs for Thee; </l2>

<l>As rain that makes the pastures sing, </l>

<l3>Art Thou, my God, to me. </l3>


<attr><name>G. Ter Steegen</name></attr>

Attributions, Citations, Dates, and Names

Attributions of authors of poetry, letters, etc. may be marked with the <attr> element. This might be rendered right-justified and italicized. Also, names may be marked with the <name> element. When they are thus marked, an index of names can be automatically constructed. A different representation of the name for the index may be specified with the title attribute: <name title="Ter Steegen, Gerhard">G. Ter Steegen</name>. (The title attribute may be used with any element; it is used for index entries and tables. In web-based presentation of a book, it may also be used as a "tool-tip".)

Citations of other works such as books or treatises may be marked with the <citation> element. That element may also take an href attribute to specify a URI for the cited work, if available. The <date> element may be used to mark dates that occur in the text. A value attribute may be used to specify the date in ISO format, as in this example: <date value="1997.12.25">last Christmas</date>. The insertIndex element (described in the next section) may be used to insert an index of works cited, dates, or personal names. In each of these cases, the intent is to assist searching and indexing. Searches for names can be restricted to names only, and an index of dates mentioned in a book can be constructed.

Index Entries and Indexes

Passages in the text may be marked for insertion into an index using the <index> element. For example, one might mark a passage for inclusion in a subject index this way:

<index type="subject" subject1="Christian Life" subject2="Sanctification" title="Apotheosis">Apotheosis (or Deification) is an ancient theological word commonly used in Eastern theology to describe the process by which a Christian becomes more like God . . . </index>.

The title attribute is used as the identifier in the Table of Contents. If it is not present, the text inside the <index> element is used as a title. The type attribute is used to identify the index that this reference is to be added to; legal values include subject (the subject index for the book) and globalSubject (the library-wide subject index). Other values may be used for specialized indexes.

A document may use several user-selected types of index entries. An XML element (<insertIndex type="subject" />) is also provided to specify that a sorted, hierarchical index of all the "subject" (e.g.) index entries should be inserted at that point, with links to the appropriate locations in the text. Certain additional index types are also understood: <insertIndex type="name" /> inserts an index of all names marked with the <name> element; if the title attribute is present, it is used as the index entry. Similarly, indexes may be inserted for citations, dates, foreign words and phrases, images (<img>), names, scripture references (<scripRef>), and scripture commentary (<scripCom>).

Terms, Definitions, and Glossaries

Some documents contain a glossary. Glossaries may be marked up with the <glossary> element and HTML <dl> (definition list), <dt> (term) and <dd> (definition) elements. The glossary may take an optional type attribute, to specify how the glossary may be used.




<dd>An ancient theological word used to describe the process by which a Christian becomes more like God</dd>



Software tools will hopefully be provided for composing two documents (which may be the same), using the glossary in one and the text in another. Words of the text defined in the glossary could be footnoted, underlined and linked, or defined in a separate window.

Additions and Deletions

The <body> section of the electronic text should have all of the contents of the print edition. However, for display purposes, it may be desirable to add or delete to the print edition. For example, it may be desirable to delete the original table of contents and replace it with one that is automatically generated. The <added> element is used to mark sections that have been added and do not appear in the print edition, and the <deleted> element is used to mark the sections that have been deleted and should not appear when the book is presented, even though they remain in the XML for those who want to see exactly how the print edition looked. For example, a table of contents might be marked this way:


<H1>Table of Contents</H1>

<insertContents level="2" />



[original table of contents here]


Header Information

The head section of a ThML etext has the most detailed (and least frequently used) markup. In a practical ThML software system, much of this information will be filled in by making entries in a form or template, including pasting the MARC record for the print source into the form. The head section may start with some HTML elements, such as <title> and <meta>. In addition, it has three optional sections that are unique to ThML: <generalInfo>, <electronicEdInfo> and <printSourceInfo>.

The <generalInfo> section contains information about the text that is not specific to the electronic edition or the print edition on which it is based. Whenever possible, its components are filled in from the information in the MARC record. It may contain these elements:


<author></author> These first fields are taken from the MARC record

<title> [Closing tags (omitted here for readability) are required]









<firstPublished> Date first published in any edition

<primaryLanguage> Primary language of the text

<otherLanguage> All other languages used in the text

<originalLanguage> Original language in which book was written

<copyrightComments> Added by producer of electronic edition -- for example, that a copyright renewal search was performed, with negative result

<description> Textual description of book, "blurb"

<pubHistory> Any available information on the publication history

<comments> Any other comments


The printSourceInfo section contains information specific to the print source from which the electronic text was derived, if there is one. The elements it may contain are these:



<pubLocation> Publisher location

<pubDate> Publication date





<frontImageURL> Electronic photograph of front of book, ~ 200 pixels wide

<spineImageURL> Electronic photograph of spine of book, ~ 200 pixels high

<copyLocation> E.g. Buswell library, 231.4 b29h c.2.

<sourceURLbase> Base URL of print source, e.g. page scans of book

<MARCformatted> Formatted text version of MARC record, as returned by Library of

Congress' Z39.50 gateway (

<MARCtagged> Machine readable version of MARC record

<MARCsource> Source from which MARC record was acquired



The electronicEdInfo section contains information specific to the electronic edition, such as publication information, editorial practices and status, etc. It may contain the following elements:


<URL> On-line location where text is published

<scanner> Person to scanned and OCRed the book

<typist> Person who typed the book, e.g. Kathy Sewell ([email protected])

<source> Other source for electronic text, e.g. Wiretap

<sourceURL> URL for source

<proofreader> People who proofread the text

<markup> Person who applied ThML markup

<editorialComments> Comments about editorial practices: whether spelling was normalized, what was done with end-of-line hyphens, corrections that were made, tagging practices, etc.

<revisionHistory> A list of published editions and changes between them

<status> Current status of text -- e.g. This text still needs proofreading

<publisherID> Publisher code of electronic edition, as assigned by the CCEL, e.g. ccel

<pubDate> Date of publication, YYYY-MM-DD format

<authorID> Author ID as assigned by publisher

<bookID> Book ID as assigned by publisher

<copyright> Copyright statement for electronic edition

<version> Edition or version of electronic edition, e.g. 1.1

<ISBN> ISBN of electronic edition, if available

<MARCtagged> MARC record for electronic edition

<comments> Other comments


Alphabetical List of ThML Elements in Body

The following list contains the special XML elements that may be used for ThML markup in the body section of a text. Elements that occur in HTML may also be used, but they are not listed here.





Text added to print edition

<added reason="Automatic TOC" resp="whp"><insertContents level="2" /></added>


Attribution, e.g. poem author

<attr>G. Ter Steegen</attr>


Reference to another work

<citation title="Imitatio Christi. English." href="/k/kempis/imitation/">Imitation of Christ</citation>


Any date or time

<date value="1997.12.25">last Christmas</date>


Text from print edition that should not be displayed

<deleted reason="Replaced with automatic TOC"><H1>Table of Contents</H1></deleted>


Major divisions in text

<div2 type="Chapter" n="I" title="Of the Inward Life">


Index entry

<index type="subject" subject1="Christian Life" subject2="Sanctification" title="Apotheosis">The word apotheosis . . . </index>


Insert table of contents here

<insertContents level="2" />


Insert index here

<insertIndex type="foreign" />


Foreign language passage

<foreign lang="he" dir="rtl">yhwh</foreign>


Mark a glossary

<glossary type="lexicon"><dl><dt><dd></glossary>


Line of verse

<l>O God, a world of empty show,</l>


Line of verse (indented)

<l2>Dark wilds of restless, fruitless quest</l2>


Line of verse (more indented)

<l3>Within, with Thee, is rest.</l3>


A person's name

<name title="Ter Steegen, Gerhard">G. Ter Steegen</name>


Footnotes, endnotes, etc.

<note place="foot" resp="editor" target="#p1" targetEnd="#p2"></note>


Page break in print edition

<pb n="37" href="page37.gif" />


Commentary on scripture

<scripCom passage="Rom. 8:28" version="LXX"></scripCom>


Set scripture context for parser

<scripContext passage="Romans 8" version="NRSV" />


Turn scripture ref parser off

<scripParseOff />


Turn scripture ref parser on

<scripParseOn />


Scripture reference

<scripRef passage="Rom. 8.28" version="NRSV"></scripRef>


Scripture passage

<scripture passage="Rom. 8:28" version="NIV"></scripture>


Chapter or section info



Synchronization point

<sync type="Strongs" value="G42" />


Poetry, verse


Table 1: ThML Body Elements

Using ThML in the CCEL

Software for processing and displaying ThML documents for the CCEL is being designed. The plan is that documents will be prepared in Microsoft Word, using paragraph styles and embedded XML codes. These will be entered with the assistance of macros and toolbar buttons. The MARC record for the print source will also be pasted into the document. Another paper (Theological Markup Language in Microsoft Word) describes the use of Microsoft Word for entering ThML.

The Word document will then be converted to XML format, which will be the "base format" for the text. A tool may be provided for converting the XML form back into a Microsoft Word document for further editing. (Microsoft has also said that they want to make the Office Suite the best environment for working with XML documents, so these conversion programs may not be necessary in the future.) This XML document will be converted by program to other desired formats, such as a collection of linked HTML files, plain text, PDF, and others.


Theological study requires text with relatively rich markup, needs that differ from other applications. The Theological Markup Language has been designed to address these needs powerfully and without too much complexity. ThML was designed to be a rich enough representation to support powerful indexing and user interface features and to allow conversion into other popular formats without loss. It is also hopefully a language that can be learned without extraordinary effort. It is a fundamental element of the Christian Classics Ethereal Library system, and it will make the library far more functional and useful.


Appendix A: ThML DTD v. 0.93


ThML DTD, v. 0.93, Harry Plantinga, 1998.08.14. This DTD for ThML is

essentially a "request for comments": your comments would be

appreciated; changes are likely.

ThML is defined as an add-on to HTML 4.0 Strict. The intent is

that the added elements will be XML compatible. It may be

possible to validate the document with an SGML parser, then use

it as well-formed XML added to HTML.


<!ENTITY % html4 system

"" >%html4;

<!--====================== ThML Content Models ========================-->

<!-- These elements will be added to HTML inline elements -->

<!ENTITY % scrip "scripture | scripCom | scripRef | scripContext |

scripParseOn | scripParseOff">

<!ENTITY % info "attr | citation | date | index | insertContents |

insertIndex | foreign | name | note | pb | sync">

<!-- Revised HTML content models with some additions for inline and block

These add on to the html4 inline and block entities:

inline: b, i, a, img, br, q, sub, sup, etc.

block: h1, ul, ol, dl, dir, blockquote, form, hr, table, object, etc.

The standard html4 list of attributes is given by the attrs entity:

attrs: id, class, style, title, lang, dir, [events] -->

<!ENTITY % ThML.inline "%inline; | %scrip; | %info;" >

<!ENTITY % ThML.block "%block; | added | deleted | glossary |

sectionInfo | verse" >

<!--====================== ThML Inline Elements =======================-->

<!ELEMENT citation - - (%ThML.inline;) -- a reference to another work-->

<!ATTLIST citation


href %URI; #IMPLIED -- reference to cited work -->

<!ELEMENT date - - (%ThML.inline;) -- a date -->

<!ATTLIST date


value %Datetime; #IMPLIED -- ISO format, e.g. 1998.08.14 -->

<!ELEMENT index - - (%ThML.inline;)>

<!ATTLIST index --specify that an index entry be made for this text--


type CDATA #REQUIRED -- subject, globalSubject, or your own--




subject4 CDATA #IMPLIED >

<!-- The insertIndex element is used to insert an index at a point in the

text. The index is built from the corresponding elements in the text,

using the title attribute (if present) or the text of the element as

the index entry. It understands types matching the type of the index

elements. It also understands types citation, date, foreign, img, name, scripture, scripCom, scripRef. -->

<!ELEMENT insertIndex - O EMPTY>

<!ATTLIST insertIndex


<!-- The insertContents element inserts a table of contents built from <divn>

elements, using their titles as entries in the TOC. -->

<!ELEMENT insertContents - O EMPTY>

<!ATTLIST insertContents


<!ELEMENT foreign - - (%ThML.inline;) -- foreign-language passages -->

<!ATTLIST foreign

lang %LanguageCode; #REQUIRED --like attrs but lang required.--

dir (ltr|rtl) #IMPLIED


%events; >

<!ELEMENT (attr | name) - - (%ThML.inline;) --attributions, personal names-->

<!ATTLIST (attr | name) --if name has title attribute,--

%attrs; --it will be used in index -->

<!--The note element is used for footnotes, endnotes, marginal notes, etc.-->

<!ELEMENT note - - (%ThML.block;) --may want paragraphs, etc. in notes-->

<!ATTLIST note


place (foot | end | inline | margin | interlinear) foot

resp CDATA #IMPLIED -- responsible person, e.g. editor --

target IDREF #IMPLIED -- start of annotated text, if elsewhere --

targetEnd IDREF #IMPLIED -- end of annotated section, if a span --

anchored (yes|no) yes -- yes (default) if reference is to a specific

location; no for marginal notes, etc. -->


<!ATTLIST pb -- pb marks page break in print edition --


n CDATA #IMPLIED -- number of next page --

href %URI; #IMPLIED -- URI of image of page -->


<!ATTLIST sync


type CDATA #REQUIRED -- e.g. strongs, date --

value CDATA #REQUIRED -- key to sync to -->

<!--=================== Scripture-related Elements ====================-->

<!ENTITY % scripturePassage "CDATA" -- e.g. Rom. 8:28-30, 9.10-11, Rev 2-->

<!ENTITY % scriptureVersion "CDATA" -- e.g. KJV, NIV, Vulgate: -->

<!ELEMENT scripture - - (%ThML.inline) -- passage of scripture -->

<!ELEMENT scripCom - - (%ThML.inline) -- commentary on scripture -->

<!ELEMENT scripRef - - (%ThML.inline) -- reference to scripture -->

<!ELEMENT scripContext - O EMPTY -- set scripture context -->

<!ATTLIST (scripture | scripCom | scripRef | scripContext)


passage %scripturePassage; #IMPLIED

version %scriptureVersion; #IMPLIED >

<!ELEMENT scripParseOn - O EMPTY -- turn parser on -->

<!ELEMENT scripParseOff - O EMPTY -- turn parser off -->

<!--====================== ThML Block Elements =======================-->

<!ELEMENT added - - (%ThML.block;) -- text not in print edition -->

<!ELEMENT deleted - - (%ThML.block;) -- text in print v. not to be shown-->

<!ATTLIST (added | deleted)


reason CDATA #IMPLIED -- reason for addition or deletion --

date %Datetime; #IMPLIED -- date/time of change, 1998.08.14 -->

<!ELEMENT glossary - - (DT,DD+)+ -- definitions that may be linked-->

<!ATTLIST glossary


type CDATA #IMPLIED -- type of glossary -->

<!ELEMENT verse - - (l | l2 | l3)+ -- a verse of poetry -->

<!ATTLIST verse

%attrs; >

<!ELEMENT (l | l2 | l3) - - (%ThML.inline;) -- lines of poetry -->

<!ATTLIST (l | l2 | l3)

%attrs; >

<!ELEMENT sectionInfo - - (%ThML.inline;) -- Intro info on a section -->

<!ATTLIST sectionInfo -- such as a chapter precis --

%attrs; >

<!--========================= ThML Divisions ==========================-->

<!-- The outer structure of the ThML.body is a set of nested divisions.-->

<!ELEMENT div1 - - ( div2+ | %ThML.block; )>

<!ELEMENT div2 - - ( div3+ | %ThML.block; )>

<!ELEMENT div3 - - ( div4+ | %ThML.block; )>

<!ELEMENT div4 - - ( div5+ | %ThML.block; )>

<!ELEMENT div5 - - ( div6+ | %ThML.block; )>

<!ELEMENT div6 - - ( div7+ | %ThML.block; )>

<!ELEMENT div7 - - ( %ThML.block; )>

<!ATTLIST ( div1|div2|div3|div4|div5|div6|div7 )


type CDATA #IMPLIED -- e.g. Chapter, Section, Book --

n CDATA #IMPLIED -- number of Chapter, etc., used in --

-- table of contents, etc. -->

<!--====================== ThML Header Elements =======================-->

<!-- For a description of these elements please see the ThML definition

and document template. They must be in proper order, as they are

in the template. -->

<!ELEMENT generalInfo - - (author*, title, uniformTitle?, editor*, translator*, edition?, notes?, LCNumber?, DeweyNumber?, subjects?, firstPublished?, primaryLanguage?, otherLanguage*, originalLanguage, copyrightComments?, description?, pubHistory?, comments?)>

<!ELEMENT printSourceInfo - - (publisher?, pubLocation?, pubDate?, copyright?, seriesName?, volume?, ISBN?, frontImageURL*, spineImageURL*, copyLocation*, sourceURLbase?, MARCformatted, MARCtagged?, MARCsource, comments?)>

<!ELEMENT electronicEdInfo - - (URL*, scanner*, typist*, source*, sourceURL*, proofreader*, markup*, editorialComments?, revisionHistory?, status?, publisherID?, pubDate?, authorID+, bookID, copyright?, version, ISBN?, MARCtagged?, comments?)>

<!ELEMENT (author | editor | translator | edition | notes | LCNumber | DeweyNumber | subjects | firstPublished | primaryLanguage | otherLanguage | originalLanguage | copyrightComments | description | pubHistory | comments | publisher | pubLocation | pubDate | copyright | seriesName | volume | ISBN | frontImageURL | spineImageURL | copyLocation | sourceURLbase | MARCtagged | MARCformatted | MARCsource | URL | scanner | typist | source | sourceURL | proofreader | markup | editorialComments | revisionHistory | status | publisherID | authorID | bookID | version) - - (#PCDATA)>

<!--=================== Structure of ThML Documents ====================-->


A ThML document might look like this:















<!ELEMENT ThML.head - - (TITLE & BASE? & generalInfo & printSourceInfo?

& electronicEdInfo) -- TITLE, BASE from html4 -->

<!ELEMENT ThML.body - - (div1+) >

<!ELEMENT ThML - - (ThML.head, ThML.body) -- document root element -->


%i18n; -- lang, dir -- >

This document (last modified August 19, 1998) from