1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
|
---
title: XML
author: [email protected]
---
# Pandoc XML format
This document describes Pandoc's `xml` format, a 1:1 equivalent
of the `native` and `json` formats.
Here's the xml version of the beginning of this document,
to give you a glimpse of the format:
```xml
<?xml version='1.0' ?>
<Pandoc api-version="1,23,1">
<meta>
<entry key="author">
<MetaInlines>[email protected]</MetaInlines>
</entry>
<entry key="title">
<MetaInlines>XML</MetaInlines>
</entry>
</meta>
<blocks>
<Header id="pandoc-xml-format" level="1">Pandoc XML format</Header>
<Para>This document describes Pandoc’s <Code>xml</Code> format, a 1:1 equivalent<SoftBreak />of the <Code>native</Code> and <Code>json</Code> formats.</Para>
...
</blocks>
</Pandoc>
```
## The tags
If you know [Pandoc types](https://hackage.haskell.org/package/pandoc-types-1.23.1/docs/Text-Pandoc-Definition.html), the XML conversion is fairly straightforward.
These are the main rules:
- `Str` inlines are usually converted to plain, UTF-8 text (see below for exceptions)
- `Space` inlines are usually converted to " " chars (see below for exceptions)
- every `Block` and `Inline` becomes an element with the same name and the same capitalization:
a `Para` Block becomes a `<Para>` element, an `Emph` Inline becomes an `<Emph>` element,
and so on;
- the root element is `<Pandoc>` and it has a `api-version` attribute, whose value
is a string of comma-separated integer numbers; it matches the `pandoc-api-version`
field of the `json` format;
- the root `<Pandoc>` element has only two children: `<meta>` and `<blocks>`
(lowercase, as in `json` format);
- blocks and inlines with an `Attr` are HTM-like, and they have:
- the `id` attribute for the identifier
- the `class` attribute, a string of space-separated classes
- the other attributes of `Attr`, without any prefix (so no `data-` prefix, instead of HTML)
- attributes are in lower (kebab) case:
- `level` in Header
- `start`, `number-style`, `number-delim` in OrderedList;
style and delimiter values are capitalized exactly as in `Text.Pandoc.Definition`;
- `format` in `RawBlock` and RawInline
- `quote-type` in Quoted (values are `SingleQuote` and `DoubleQuote`)
- `math-type` in Math (values are `InlineMath` and `DisplayMath`)
- `title` and `src` in Image target
- `title` and `href` in Link target
- `alignment` and `col-width` in ColSpec (about `col-width` values, see below);
(alignment values are capitalized as in `Text.Pandoc.Definition`)
- `alignment`, `row-span` and `col-span` in Cell
- `row-head-columns` in TableBody
- `id`, `mode`, `note-num` and `hash` for Citation (about Cite elements, see below);
(`mode` values are capitalized as in `Text.Pandoc.Definition`)
The classes of items with an `Attr` are put in a `class` attribute,
so that you can style the XML with CSS.
## Str and Space elements
`Str` and `Space` usually result in text and normal " " spaces, but there are exceptions:
- `Str ""`, an empty string, is not suppressed; instead it is converted into a `<Str />` element;
- `Str "foo bar"`, a string containing a space, is converted as `<Str content="foo bar" />`;
- consecutive `Str` inlines, as in `[ ..., Str "foo", Str "bar", ... ]`,
are encoded as `foo<Str content="bar" />` to keep their individuality;
- consecutive `Space` inlines, as in `[ ..., Space, Space, ... ]`,
are encoded as `<Space count="2" />`
- `Space` inlines at the start or at the end of their container element
are always encoded with a `<Space />` element, instead of just a " "
These encodings are necessary to ensure 1:1 equivalence of the `xml` format with the AST,
or the `native` and `json` formats.
Since the ones above are corner cases, usually you should not see those `<Str />` and `<Space />`
elements in your documents.
## Added tags
Some other elements have been introduced to better structure the resulting XML.
Since they are not Pandoc Blocks or Inlines, or they have no constructor or type
in Pandoc's haskell code, they are kept lowercased.
### BulletList and OrderedList items
Items of those lists are embedded in `<item>` elements.
These snippets are from the `xml` version of `test/testsuite.native`:
```xml
<BulletList>
<item>
<Plain>asterisk 1</Plain>
</item>
<item>
<Plain>asterisk 2</Plain>
</item>
<item>
<Plain>asterisk 3</Plain>
</item>
</BulletList>
...
<OrderedList start="1" number-style="Decimal" number-delim="Period">
<item>
<Plain>First</Plain>
</item>
<item>
<Plain>Second</Plain>
</item>
<item>
<Plain>Third</Plain>
</item>
</OrderedList>
```
### DefinitionList items
Definition lists have `<item>` elements.
Each `<item>` term has only one `<term>` child element,
and one or more `<def>` children elements.
This snippet is from the `xml` version of `test/testsuite.native`:
```xml
<DefinitionList>
<item>
<term>apple</term>
<def>
<Plain>red fruit</Plain>
</def>
</item>
<item>
<term>orange</term>
<def>
<Plain>orange fruit</Plain>
</def>
</item>
<item>
<term>banana</term>
<def>
<Plain>yellow fruit</Plain>
</def>
</item>
</DefinitionList>
```
### Figure and Table captions
Figures and tables have a `<Caption>` child element,
which in turn may optionally have a `<ShortCaption>` child element.
This snippet is from the `xml` version of `test/testsuite.native`:
```xml
<Figure>
<Caption>
<Plain>lalune</Plain>
</Caption>
<Plain><Image src="lalune.jpg" title="Voyage dans la Lune">lalune</Image></Plain>
</Figure>
```
### Tables
A `<Table>` element has:
- a `<Caption>` child element;
- a `<colspecs>` child element, whose children are empty
`<ColSpec alignment="..." col-width="..." />` elements;
- a `<TableHead>` child element;
- one or more `<TableBody>` children elements, that in turn
have two children: `<header>` and `<body>`, whose children
are `<Row>` elements;
- a `<TableFoot>` child element.
This specification is debatable; I have these doubts:
- is it necessary to enclose the `<ColSpec>` elements in a `<colspecs>` element?
- to discriminate between header and data cells in table bodies,
there are the `row-head-columns` attribute, and the `<header>` and `<body>` children
of the `<TableBody>` element, but there's only one type of cell:
every cell is a `<Cell>` element
- the specs are a tradeoff between consistency with pandoc types and CSS compatibility;
this way bodies' header rows are easily stylable with CSS, while header columns are not
The `ColWidthDefault` value becomes a "0" value for the attribute `col-width`;
this way it's type-consistent with non-zero values, but I'm still doubtful whether to
leave its value as a "ColWidthDefault" string.
Here's an example from the `xml` version of `test/tables/planets.native`:
```xml
<Table>
<Caption>
<Para>Data about the planets of our solar system.</Para>
</Caption>
<colspecs>
<ColSpec col-width="0" alignment="AlignCenter" />
<ColSpec col-width="0" alignment="AlignCenter" />
<ColSpec col-width="0" alignment="AlignDefault" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignRight" />
<ColSpec col-width="0" alignment="AlignDefault" />
</colspecs>
<TableHead>
<Row>
<Cell col-span="2" row-span="1" alignment="AlignDefault" />
<Cell col-span="1" row-span="1" alignment="AlignDefault">
<Plain>Name</Plain>
</Cell>
<Cell col-span="1" row-span="1" alignment="AlignDefault">
<Plain>Mass (10^24kg)</Plain>
</Cell>
...
</Row>
</TableHead>
<TableBody row-head-columns="3">
<header />
<body>
<Row>
<Cell col-span="2" row-span="4" alignment="AlignDefault">
<Plain>Terrestrial planets</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>Mercury</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>0.330</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>4,879</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>5427</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>3.7</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>4222.6</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>57.9</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>167</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>0</Plain>
</Cell>
<Cell alignment="AlignDefault">
<Plain>Closest to the Sun</Plain>
</Cell>
</Row>
...
</body>
</TableBody>
<TableFoot />
</Table>
```
### Metadata and MetaMap entries
Metadata entries are meta values (`MetaBool`, `MetaString`, `MetaInlines`, `MetaBlocks`,
`MetaList` and `MetaMap` elements) inside `<entry>` elements.
The `<meta>` and the `<MetaMap>` elements have the same children elements (`<entry>`),
which have a `key` attribute.
`<MetaInlines>`, `<MetaBlocks>`, `<MetaList>` and `<MetaMap>` elements
all have children elements.
`<MetaString>` elements have only text.
`<MetaBool>` elements are empty, they can be either `<MetaBool value="true" />`
or `<MetaBool value="false" />`.
This snippet is from the `xml` version of `test/testsuite.native`:
```xml
<meta>
<entry key="author">
<MetaList>
<MetaInlines>John MacFarlane</MetaInlines>
<MetaInlines>Anonymous</MetaInlines>
</MetaList>
</entry>
<entry key="date">
<MetaInlines>July 17, 2006</MetaInlines>
</entry>
<entry key="title">
<MetaInlines>Pandoc Test Suite</MetaInlines>
</entry>
</meta>
```
### Cite elements
`Cite` inlines are modeled with `<Cite>` elements, whose first child
is a `<citations>` element, that have only `<Citation>` children elements.
`<Citation>` elements are empty, unless they have a prefix and/or a suffix.
Here's an example from the `xml` version of `test/markdown-citations.native`:
```xml
<Para><Cite><citations>
<Citation note-num="3" mode="AuthorInText" id="item1" hash="0" />
</citations>@item1</Cite> says blah.</Para>
<Para><Cite><citations>
<Citation note-num="4" mode="AuthorInText" id="item1" hash="0">
<suffix>p. 30</suffix>
</Citation>
</citations>@item1 [p. 30]</Cite> says blah.</Para>
<Para>A citation group <Cite><citations>
<Citation note-num="8" mode="NormalCitation" id="item1" hash="0">
<prefix>see</prefix>
<suffix> chap. 3</suffix>
</Citation>
<Citation note-num="8" mode="NormalCitation" id="пункт3" hash="0">
<prefix>also</prefix>
<suffix> p. 34-35</suffix>
</Citation>
</citations>[see @item1 chap. 3; also @пункт3 p. 34-35]</Cite>.</Para>
```
|