ThMLMarkup in Microsoft Word

ThML Markup in Microsoft Word

For the Christian Classics Ethereal Library

Version 0.96, September 16, 1998

Harry Plantinga

Abstract

This paper describes the markup that is used to prepare texts for the Christian Classics Ethereal Library (CCEL) in Microsoft Word. The markup consists of certain paragraph and character styles and XML tags used for specified purposes. A template contains the styles that are used and a header that may be filled out for the bibliographic information of the head section. A program will then convert from ThML-formatted Word documents to XML.

Introduction

The Theological Markup Language (ThML) is an XML-based markup languages with support for information often used in theological study, such as scripture references and commentary, synchronization of multiple related text, and indexing systems such as Strongs numbers. Another design goal was that the language represent all of the information about a text needed for use in a rich digital library and the information represented in other common formats used for theological etexts. Documents marked up with ThML contain enough information that subjects, citations, dates, names, scripture references, and other information can be searched or tabulated by computer. It will be possible to build global subject and scripture reference indexes for a whole library, for example.

More information on the design of the language, as well as a full definition, is available in the paper ThML: Theological Markup Language for the Christian Classics Ethereal Library. This document describes the guidelines for formatting ThML documents in Microsoft Word, using the ThML template, which contains styles, macros, a toolbar, a menu, and a header template.

Preparing Etexts for the CCEL in Microsoft Word

In order to prepare a text for the CCEL with Microsoft Word, the first step is to get the ThML Template, ThML096.doc, and put it in the Templates folder, inside the Microsoft Office folder. The template can be downloaded from the ThML web page, http://ccel.wheaton.edu/ThML. Once it is installed, you can create a new ThML document by choosing New from the file menu and selecting the ThML template. You can also attach the template to an existing document with the Templates and Add-Ins item on the Tools menu. In either case, the resulting document will have ThML styles and other resources available. Then the document is typed or scanned, if necessary, and formatted with appropriate styles and markup codes as described below. Footnotes may be entered as ordinary footnotes in Word, using the Insert | Footnote… menu item or the Insert-Footnote-Now shortcut, Alt+Ctrl+F. Tables and images may also be inserted normally.

If Microsoft Word is not available, the document may be entered in any word processor, and the special formatting in Word can be left for someone else. It is still helpful to format for the CCEL as much as possible, though. This would include using appropriate font sizes and styles, paragraph indentation, Greek/Hebrew fonts if needed, and inserting footnotes in the method supported by the word processor. There should be carriage returns only at the end of paragraphs, not lines, and blank lines should not be added paragraphs except where there is extra space in the text. The some XML codes described below could also be entered during data entry, if desired–most importantly, codes for page breaks, divisions, and notes.

Paragraph and Character Styles

Much of the formatting in Word is done by applying character and paragraph styles to the document. Paragraph style sheets are named groupings of styles for paragraphs, such as single-space, indent first line, Times New Roman 11-point, etc. A paragraph style can be applied to a paragraph by selecting it from the left-most dropbox on the formatting toolbar. The ThML template provides several paragraph styles that should be used for formatting documents–styles such as P_First, Heading 1, Verse, BlockQuote, and others.

Character styles are similar to paragraph styles, except that they only contain character formatting and they may be used within a paragraph style. The character styles used for ThML are "XML", "Comment", "Citation", "Name", "Unclear", and "Default". Keyboard shortcuts are provided for common paragraph and character styles. A complete list of styles and uses is given in Table 1:

Style Name	Shortcut Keys	Description
Attribution	ctrl-alt-a	Author of preface, letter, etc.
BlockQuote	ctrl-alt-b	Extended quotation
Citation (character)	ctrl-alt-t	References to other works
Comment (character)	ctrl-alt-c	Comment -- ignored
Default (character)	ctrl-alt-d	Default paragraph font
Definition	ctrl-alt-j	Definition of term (see Term)
HeaderInfo		ThML.head section of document
Heading 1	ctrl-alt-1	Level-1 heading
Heading 2	ctrl-alt-2	Level-2 heading
Heading 3	ctrl-alt-3	Level-3 heading
Heading 4	ctrl-alt-4	Level-4 heading
HR	ctrl-alt-h	Horizontal rule
HR30		Horizontal rule, 30% of page
List, List 2, etc.	ctrl-alt-l	List -- no bullet or number
List Bullet, 2, etc.		Bullet list
List Number, 2, etc.		Numbered list
List Continue		New paragraph of list element
Name (character)	ctrl-alt-n	A person's name
P_Continue	ctrl-alt-p	Paragraph
P_First	ctrl-alt-r	First paragraph of a section
P_Resume		Continuation of a paragraph after table, etc.
Preformatted	ctrl-alt-w	Preformatted, monospace font
SectionInfo	ctrl-alt-i	Info after section title
Term	ctrl-alt-t	Term to be defined (see Definition)
Unclear	ctrl-alt-u	Text that appears to have an error
Verse, Verse 2, Verse 3	ctrl-alt-v	Lines of poetry, verse, etc.
XML (character)	ctrl-alt-x	XML (or HTML) markup

Table 1: Paragraph Styles and Uses

XML and HTML Markup

When markup requires attributes (e.g. lang="el"), paragraph styles are not sufficient, and XML tags are used. The markup may consist of opening and closing tags with attributes, surrounding some text, as for example <foreign lang="el">logos</foreign>. The opening and closing tags and the contained text are called an "element." The markup may also consist of an empty element, identified by a trailing /, such as <pb n="37" />. These tags are represented in a Word document with the XML character style, appearing as red Courier New text.

Document Structure

ThML documents have a header section, with information about the document, and a body section, containing the document itself. When a new ThML document is created, a template for the head section appears. As much of the template as possible can be filled in. If possible, the MARC record should be retrieved from a source such as the Library of Congress gateway (@http://lcweb.loc.gov/z3950/gateway.html@), in tagged and formatted form, and inserted into the header at the appropriate spot. The information in the MARC record can then be pasted into other sections of the header. Any parts of the header that can't be filled out may be left for later editors.

Body

The body of the document, placed between the <ThML.body> and </ThML.body> tags of the template, should contain all of the text in the print edition of the book. It should be made to look as similar to the book as possible using the ThML template. In fact, if desired, the ThML styles may be modified to make the document look more like the book, though style names shouldn't be changed and styles other than those in the template should not be used. However, is not necessary to retain the line breaks or unambiguous end-of-line hyphens of the print edition. Page breaks should be noted only with the <pb> element, described below.

Headings and Divisions

Headings for the preface, table of contents, and index, chapter titles, section heads, and the like should all be formatted using the styles Heading 1, Heading 2, Heading 3, or Heading 4. These styles can also be applied with ctrl-alt-1, etc. and viewed or modified in the outline view of a document.

XML divn tags are used to mark the structural sections of a document. These may often match the Heading paragraphs, but they offer more control. They are used for building a table of contents and also to specify points at which an electronic text may be split into pages or chunks for easy viewing. For example, the header for this section might be marked up in this way:

Headings and Divisions

…</div4>

<div1> is used for top level divisions, such as the title page, preface, table of contents, and chapter titles. <div2> is used for lower-level divisions, and so on. The optional title attribute is used in constructing a table of contents and may be used for running heads or other identification purposes. The optional type and n attributes may be used to specify the type and number of the division. If they are present, they will also be used to identify the section in the table of contents. If the n attribute is present for one divn element but absent for the next divn element at the same level, it will be added in the XML conversion process by incrementing the previous divn element's n attribute.

The <insertContents level="2" /> tag may be used to insert a table of contents at a particular location in the text. The level attribute specifies the depth of the table of contents. In the example above, all of the <div1> and <div2> entries would be gathered and listed in a hierarchical list. Each entry would be linked to the appropriate section.

To replace the existing table of contents with the new one, the <added> and <deleted> elements may be used. <added> elements represent text that is not present in the print edition, and <deleted> elements represent sections that are in the print edition but should not be displayed in the electronic edition. They may be used as in this example:

<added>

</added>

[original table of contents here]

</deleted>

Page Breaks

It is often useful to know the page breaks from the print edition of a book. They may be used as targets for subject index entries that identify the page of the entry or to display a text with the pagination of the print edition. Page breaks are marked by the insertion of <pb /> tags, with the n attribute giving the page number of the upcoming page (<pb n="37" /> or <pb n="xii" />). These elements should appear at the start of the identified page. So that it is not necessary to add the n attribute for every page, it will be added by the ThML conversion software by auto-incrementing the previous pb element with an n attribute, if available.

In many cases images of pages will also be available on line. The pb element also takes an href attribute specifying a URI for an image of the page (e.g. <pb n="37" href="gif/0021a.gif" />). These will also be added in the conversion process by auto-incrementing previous values if necessary. The increment algorithm will be derived heuristically, looking at the first three pb elements. It should be able at least to handle filenames ending in sequences of numbers such as these: (001.gif, 002.gif, 003.gif, …), (001a.gif, 001b.gif, 002a.gif, …), or (001a.gif, 002a.gif, 003a.gif, …) changing at every <pb/> element, or at every other (in case two pages are on each scanned image).

Paragraphs

The first paragraph of a section, by default rendered without an indented first line, may be formatted with the P_First paragraph style. Additional paragraphs in the section may be formatted with P_Continue. Paragraphs that resume after a figure or table should be formatted with P_Resume. These paragraph styles may be modified to make them similar to the styles used in the print edition of the book.

In some books a chapter or section title is followed by a short quotation, summary, or scripture reference. These can be formatted with the SectionInfo paragraph style.

Block Quotes

The BlockQuote paragraph style should be used for extended quotations. A BlockQuote paragraph is by default indented on both sides, with some extra space before and after.

Notes

Footnotes may be entered as normal footnotes in Word, and they will be converted to XML notation in the Word to XML conversion process. However, it may at times be preferable to enter notes using the XML notation directly, in order to take advantage of the greater flexibility offered, or because the word processor in use doesn't support footnotes.

The XML notation for notes uses the <note> element, following the syntax used by TEI Lite [e.g. <note place="foot" resp="whp">See http://www_tei.uic.edu/orgs/tei/lite </note>]. The place attribute specifies where the appears in the text (e.g. end, foot, inline, interlinear, or margin). The resp attribute identifies the person responsible for the note -- for example, the author, editor, or a person's initials. The anchored=no attribute value may be added to specify that the note is not anchored at an exact location; marginal notes typically are not anchored.

Lists

Plain, numbered, or bulleted lists with several levels of indent may be represented with Word styles List, List 2, List 3, List 4 for the plain version; List Bullet, List Bullet 2, …; List Number, List Number 2, etc. There are also styles called List Continue, List Continue 2, …, used for additional paragraphs of a list entry.

This is a plain list with

one

one-a

one-b

two

three entries.

This is a bullet list with

one

one-a

one-b

two

three entries and a

continued entry.

This is a numbered list with

1. one

1. one-a

2. one-b

2. two

3. three entries and a

continued entry.

Terms, Definitions, and Glossaries

Some documents contain a glossary. It should be surrounded by <glossary> tags, and individual terms and definitions should use paragraph styles called Term and Definition.

Agape

Greek for the unconditional love which God extents to his people.

Apotheosis

An ancient theological word used to describe the process by which a Christian becomes more like God.

</glossary>

Verse

Theological books often contain verse -- poetry, hymns, or versified presentation of material such as the Psalms. Verse is often typeset with varying levels of indentation. These are represented with Verse 1, Verse 2, and Verse 3 paragraph styles. In the example below, the first and third line of each stanza is of style Verse 1, the second Verse 2, and the fourth Verse 3.

O God, a world of empty show,

Dark wilds of restless, fruitless quest

Lie round me wheresoe'er I go:

Within, with Thee, is rest.

And sated with the weary sum

Of all men think, and hear, and see,

O more than mother's heart, I come,

A tired child to Thee.

Sweet childhood of eternal life!

Whilst troubled days and years go by,

In stillness hushed from stir and strife,

Within Thine Arms I lie.

Thine Arms, to whom I turn and cling

With thirsting soul that longs for Thee;

As rain that makes the pastures sing,

Art Thou, my God, to me.

G. Ter Steegen
Attributions, Citations, Dates, Names, and Unclear

Attributions to authors, of poetry or letters for example, may be given the Attribution paragraph style, as in the "G. Ter Steegen" attribution in the poem above. These are by default rendered as right-justified, italic text. Names that occur in text may be given the Name character style. Then they can be found for inclusion in an index of names referred to. It is also possible to make the name appear in the index in another form by using the XML <name> element instead of the paragraph style, as in this example: <name title="Ter Steegen, Gerhard">G. Ter Steegen</name>. Here the title attribute is used as the index entry.

Citations of other works such as books or treatises may be marked with the Citation character style or the <citation> element. That element may also take an href attribute to specify a URI for the cited work, if available. The <date> element may be used to mark dates that occur in the text. A value attribute may be used to specify the date in ISO format, as in this example: <date value="1997.12.25">last Christmas</date>.

During the course of proofreading, some words or phrases will appear possibly to have an error of some sort, but correcting the text may be postponed. In this case, the phrase may be marked with the Unclear character style, as in this example: this phrase doesn't sense. These should be corrected later by comparison with the print edition.

Scripture

In theological texts, scripture passages may be cited, quoted, or explained. Citations refer to a passage, but quotes include the text of a passage in the document. References may occur in footnotes, in parentheses (Phil. 2:1-11), or in the text itself -- see Rom. 8:28. Context may be needed in order to interpret a reference -- see verse 29 and 10:8-13. Several passages may be stacked together in one citation (Matt. 5:44, 46; Luke 7:42; John 5:42, 13:35, 14:15, 23; 15:12-13; 21:15-16). For marking scripture citations, ThML will use the <scripRef> element, as in this example:

<scripRef passage="Rom. 8:27,28; 10:8-13" version="NIV">Romans
8:27,28; 10:8-13</scripRef>

The scripRef element may take a title and a type attribute, used to specify the name and type of reference. If type is unspecified, the reference is assumed to be an unspecified mention of a scripture passage. If title has a value and type has one of the CCEL scripture reference values, including Article, Hymn, Sermon, Treatise, Commentary, and others, the reference will be included in the library-wide scripture reference index under the specified category.

The version attribute specifies the translation or version, and the passage attribute is a list of scripture references separated by commas (used to separate verses) or semicolons (to separate chapters). Each reference may consist of a book name (or abbreviation), a chapter, and a verse. The chapter and verse are separated by a semicolon or period. If the book name or chapter are missing, they are assumed to be the same as the previous reference. If two references are separated by a dash, all of the intermediary verses are included as well. In the case of books with only one chapter, a reference consists of a book name or abbreviation and a verse. Book names should be as they appear in the version cited or a unique prefix of at least two letters of the name. Abbreviations that are not prefixes may also be accepted by programs that process ThML documents.

Software for processing ThML texts will likely have a scripture parser incorporated that finds scripture citations and marks them appropriately, so that it will not be necessary to mark citations by hand. However, parsing text to find and identify scripture references involves several difficulties. One problem is that different translations of the Bible use different versification schemes. For example, Psalm 9 in the Vulgate is separated into two -- Psalms 9 and 10 -- in the King James Version, so that most Psalm numbers differ by one. Psalm 118:132 in the Vulgate corresponds to Psalm 119:132 in the KJV. In order to interpret a reference, then, the version must be known or assumed. Scripture references will be assumed to be compatible with the KJV, ASV, NASB, NIV, and TLB unless otherwise specified.

Context is sometimes necessary in interpreting a reference. A passage may refer to Romans 8:28 at one point and later to verses 29 and 30 and chapter 10:8-13. A parser should be able to identify the context in most cases, but in some cases it may be necessary to set the context or turn the parser off. The <scripContext version="NIV" passage="Romans 8" /> element is used to set the default context for the parser, and the <scripParseOff /> and <scripParseOn /> elements may be used to turn the parser off or on, to prevent linking of a passage such as "Bob had 2 apples and John 3." The version attribute may be set in a scripContext element but it is never set by the parser.

In theological texts, scripture is also sometimes quoted. In this case, it is not desirable to link the reference to the scripture passage, but it may be desirable to incorporate the passage into a table of scripture references. Quotations of scripture may be marked with the <scripture> element. A passage may be represented as in this example:

<scripture passage="Mark 7:16" version="NKJV">If anyone has ears to hear, let them hear!</scripture>

Index Entries

Passages in the text may be marked for insertion into an index using the <index> element. For example, one might mark a passage for inclusion in a subject index this way:

<index type="subject" subject1="Christian Life" subject2="Sanctification" title="Apotheosis">Apotheosis (or Deification) is an ancient theological word commonly used in Eastern theology to describe the process by which a Christian becomes more like God . . . </index>.

The title attribute is used in the subject index. If it is not present, the text inside the <index> element is used as a title.

A document may use several user-selected types of index entries. An XML element (<insertIndex type="subject" />) is also provided to specify that a sorted, hierarchical index of all the "subject" (e.g.) index entries should be inserted at that point, with links to the appropriate locations in the text. Certain additional index types are also understood: <insertIndex type="name" /> inserts an index of all names marked with the <name> element or style; if the title attribute is present, it is used as the index entry. Similarly, indexes may be inserted for citations, dates, foreign words and phrases, images (<img>), and scripture references (<scripRef>).

Foreign Languages

The primary language for a document is specified in the header. Passages in other languages may be marked with the foreign tag and the lang attribute. For example, the Hebrew passage <foreign lang="he" dir="rtl">yhwh</foreign> may be marked as shown. The optional dir attribute specifies the direction of the text, rtl or ltr, and the lang attribute values are as specified in ISO 639. Some examples are Dutch: nl, English: en, French: fr, German: de, Greek: el, Hebrew: he, Latin: la, Spanish: es, Portuguese: pt, Russian: ru.

If the language uses characters not available in the ISO-8859-1 (Latin-1) character set, they may be represented in Unicode (preferred), as in this Greek example (λογος) and this Hebrew example (הלהי) using the Unicode font Lucida Sans Unicode. Alternatively, they may be represented with the Latin-1 character set using an appropriate font. For example, <foreign lang="el" style="font-family: SIL Galatia"></foreign>. The Greek and Hebrew fonts used for the CCEL are the excellent, freeware SIL Galatia and SIL Ezra fonts and related software from the Summer Institute of Linguistics, used here in a Greek example ( and a Hebrew example (hwhy).

Hypertext Links

Hypertext Links can be inserted using the Microsoft Word link facility, perhaps using the ctrl-k shortcut. Links can be either HTML or XML format. This is an example of a link to the CCEL.

Horizontal Rules

Horizontal rules that span 30% of the page can be inserted with a paragraph using the HR30 style. The above paragraph is an example. The paragraph below, of style HR, represents a horizontal rule that spans the entire page.

Conclusion

Electronic texts formatted according to these guidelines will be converted to XML format using custom software. Browsers that support XML will be able to use the resulting texts directly, and the format is semantically rich enough that it will be possible to convert texts to a variety of other formats without loss. Those formats may include multi-file HTML webs, plain text, PDF, OnLine Bible, Windows Help, and others.

Libraries making use of all of the semantic information in books using this markup will be able to provide a variety of capabilities not often found in digital libraries. These capabilities include global scripture and subject indexes, indexes of foreign words and names mentioned, flexible alignment of texts, linking of arbitrary texts and dictionaries or lexicons, display of the pages of a book as text or image, cross reference systems, and the ability to convert automatically to new formats that may be needed.

Appendix: XML Body Elements

Name	Use	Example
added	Text added to print edition	<added reason="Automatic TOC" resp="whp"><insertContents level="2" /></added>
attr	Attribution, e.g. poem author	<attr>G. Ter Steegen</attr>
citation	Reference to another work	<citation title="Imitatio Christi. English." href="/k/kempis/imitation/">Imitation of Christ</citation>
date	Any date or time	<date value="1997.12.25">last Christmas</date>
deleted	Text from print edition that should not be displayed	<deleted reason="Replaced with automatic TOC"><H1>Table of Contents</H1>…</deleted>
divn	Major divisions in text	<div2 type="Chapter" n="I" title="Of the Inward Life">
index	Index entry	<index type="subject" subject1="Christian Life" subject2="Sanctification" title="Apotheosis">The word apotheosis . . . </index>
insertContents	Insert table of contents here	<insertContents level="2" />
insertIndex	Insert index here	<insertIndex type="foreign" />
foreign	Foreign language passage	<foreign lang="he" dir="rtl">hwhy</foreign>
glossary	Mark a glossary	<glossary type="lexicon"><dl><dt>…<dd>…</glossary>
l	Line of verse	<l>O God, a world of empty show,</l>
l2	Line of verse (indented)	<l2>Dark wilds of restless, fruitless quest</l2>
l3	Line of verse (more indented)	<l3>Within, with Thee, is rest.</l3>
name	A person's name	<name title="Ter Steegen, Gerhard">G. Ter Steegen</name>
note	Footnotes, endnotes, etc.	<note place="foot" resp="editor" target="#p1" targetEnd="#p2">…</note>
pb	Page break in print edition	<pb n="37" href="page37.gif" />
scripContext	Set scripture context for parser	<scripContext passage="Romans 8" version="NRSV" />
scripParseOff	Turn scripture ref parser off	<scripParseOff />
scripParseOn	Turn scripture ref parser on	<scripParseOn />
scripRef	Scripture reference	<scripRef passage="Rom. 8.28" version="NRSV" type="Commentary">…</scripRef>
scripture	Scripture passage	<scripture passage="Rom. 8:28" version="NIV">…</scripture>
sectionInfo	Chapter or section info	<sectionInfo>…</sectionInfo>
sync	Synchronization point	<sync type="Strongs" value="G42" />
verse	Poetry, verse	<verse><l>O God, a world of empty show</l></verse>

This document (last modified September 16, 1998) from Believerscafe.com