|
This change introduces a reader for mdoc, a roff-derived semantic markup
language for manual pages. The two relevant contemporary implementations
of mdoc for manual pages are mandoc (https://mandoc.bsd.lv/), which
implements the language from scratch in C, and groff
(https://www.gnu.org/software/groff/), which implements it as roff macros.
mdoc has a lot of semantics specific to technical manuals that aren't
representable in Pandoc's AST. I've taken a cue from the mandoc HTML
output and many mdoc elements are encoded as Codes or Spans with classes
named for the mdoc macro that produced them.
Much like web browsers with HTML, mandoc attempts to produce best-effort
output given all kinds of weird and crappy mdoc input. Part of the
reason it's able to do this is it uses a very accommodating parse tree
and stateful output routines specialized to the output mode, and when it
encounters some macro it wasn't expecting, it can easily give up on
whatever it was outputting and output something else. I've encoded as
much flexibility as I reasonably could into the mdoc reader here, but I
don't know how to be as flexible as mandoc.
This branch has been developed almost exclusively against mandoc's
documentation and implementation of mdoc as a reference, and the
real-world manual pages tested against are those from the OpenBSD base
system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD
install, 17 cause the mdoc reader to exit with a parse error. Any
further chasing of edge cases is deferred to future work.
Many of the tests in test/Tests/Readers/Mdoc.hs are derived directly
from mandoc's extensive regression tests.
[API change] Adds readMdoc to the public API
|