--- title: XML author: massifrg@gmail.com --- # Pandoc XML format This document describes Pandoc's `xml` format, a 1:1 equivalent of the `native` and `json` formats. Here's the xml version of the beginning of this document, to give you a glimpse of the format: ```xml massifrg@gmail.com XML
Pandoc XML format
This document describes Pandoc’s xml format, a 1:1 equivalentof the native and json formats. ...
``` ## The tags If you know [Pandoc types](https://hackage.haskell.org/package/pandoc-types-1.23.1/docs/Text-Pandoc-Definition.html), the XML conversion is fairly straightforward. These are the main rules: - `Str` inlines are usually converted to plain, UTF-8 text (see below for exceptions) - `Space` inlines are usually converted to " " chars (see below for exceptions) - every `Block` and `Inline` becomes an element with the same name and the same capitalization: a `Para` Block becomes a `` element, an `Emph` Inline becomes an `` element, and so on; - the root element is `` and it has a `api-version` attribute, whose value is a string of comma-separated integer numbers; it matches the `pandoc-api-version` field of the `json` format; - the root `` element has only two children: `` and `` (lowercase, as in `json` format); - blocks and inlines with an `Attr` are HTM-like, and they have: - the `id` attribute for the identifier - the `class` attribute, a string of space-separated classes - the other attributes of `Attr`, without any prefix (so no `data-` prefix, instead of HTML) - attributes are in lower (kebab) case: - `level` in Header - `start`, `number-style`, `number-delim` in OrderedList; style and delimiter values are capitalized exactly as in `Text.Pandoc.Definition`; - `format` in `RawBlock` and RawInline - `quote-type` in Quoted (values are `SingleQuote` and `DoubleQuote`) - `math-type` in Math (values are `InlineMath` and `DisplayMath`) - `title` and `src` in Image target - `title` and `href` in Link target - `alignment` and `col-width` in ColSpec (about `col-width` values, see below); (alignment values are capitalized as in `Text.Pandoc.Definition`) - `alignment`, `row-span` and `col-span` in Cell - `row-head-columns` in TableBody - `id`, `mode`, `note-num` and `hash` for Citation (about Cite elements, see below); (`mode` values are capitalized as in `Text.Pandoc.Definition`) The classes of items with an `Attr` are put in a `class` attribute, so that you can style the XML with CSS. ## Str and Space elements `Str` and `Space` usually result in text and normal " " spaces, but there are exceptions: - `Str ""`, an empty string, is not suppressed; instead it is converted into a `` element; - `Str "foo bar"`, a string containing a space, is converted as ``; - consecutive `Str` inlines, as in `[ ..., Str "foo", Str "bar", ... ]`, are encoded as `foo` to keep their individuality; - consecutive `Space` inlines, as in `[ ..., Space, Space, ... ]`, are encoded as `` - `Space` inlines at the start or at the end of their container element are always encoded with a `` element, instead of just a " " These encodings are necessary to ensure 1:1 equivalence of the `xml` format with the AST, or the `native` and `json` formats. Since the ones above are corner cases, usually you should not see those `` and `` elements in your documents. ## Added tags Some other elements have been introduced to better structure the resulting XML. Since they are not Pandoc Blocks or Inlines, or they have no constructor or type in Pandoc's haskell code, they are kept lowercased. ### BulletList and OrderedList items Items of those lists are embedded in `` elements. These snippets are from the `xml` version of `test/testsuite.native`: ```xml asterisk 1 asterisk 2 asterisk 3 ... First Second Third ``` ### DefinitionList items Definition lists have `` elements. Each `` term has only one `` child element, and one or more `` children elements. This snippet is from the `xml` version of `test/testsuite.native`: ```xml apple red fruit orange orange fruit banana yellow fruit ``` ### Figure and Table captions Figures and tables have a `` child element, which in turn may optionally have a `` child element. This snippet is from the `xml` version of `test/testsuite.native`: ```xml
lalune lalune
``` ### Tables A `` element has: - a `
` child element; - a `` child element, whose children are empty `` elements; - a `` child element; - one or more `` children elements, that in turn have two children: `
` and ``, whose children are `` elements; - a `` child element. This specification is debatable; I have these doubts: - is it necessary to enclose the `` elements in a `` element? - to discriminate between header and data cells in table bodies, there are the `row-head-columns` attribute, and the `
` and `` children of the `` element, but there's only one type of cell: every cell is a `` element - the specs are a tradeoff between consistency with pandoc types and CSS compatibility; this way bodies' header rows are easily stylable with CSS, while header columns are not The `ColWidthDefault` value becomes a "0" value for the attribute `col-width`; this way it's type-consistent with non-zero values, but I'm still doubtful whether to leave its value as a "ColWidthDefault" string. Here's an example from the `xml` version of `test/tables/planets.native`: ```xml Name Mass (10^24kg) ...
Terrestrial planets Mercury 0.330 4,879 5427 3.7 4222.6 57.9 167 0 Closest to the Sun ...
Data about the planets of our solar system.
``` ### Metadata and MetaMap entries Metadata entries are meta values (`MetaBool`, `MetaString`, `MetaInlines`, `MetaBlocks`, `MetaList` and `MetaMap` elements) inside `` elements. The `` and the `` elements have the same children elements (``), which have a `key` attribute. ``, ``, `` and `` elements all have children elements. `` elements have only text. `` elements are empty, they can be either `` or ``. This snippet is from the `xml` version of `test/testsuite.native`: ```xml John MacFarlane Anonymous July 17, 2006 Pandoc Test Suite ``` ### Cite elements `Cite` inlines are modeled with `` elements, whose first child is a `` element, that have only `` children elements. `` elements are empty, unless they have a prefix and/or a suffix. Here's an example from the `xml` version of `test/markdown-citations.native`: ```xml @item1 says blah. p. 30 @item1 [p. 30] says blah. A citation group see chap. 3 also p. 34-35 [see @item1 chap. 3; also @пункт3 p. 34-35]. ```