github.com/jgm/pandoc - Pandoc — The universal markup converter

Age	Commit message (Collapse)	Author
2024-12-23	MediaWiki reader: allow empty quoted attributes.	John MacFarlane
	Closes #10490.
2024-12-23	MediaWiki reader: allow cells starting with `+`.	John MacFarlane
	Closes #10491.
2024-12-23	Fix spacing error.	John MacFarlane

2024-12-22	Remove old comment-out line.	John MacFarlane

2024-12-22	Docx writer: better handling of chapters.	John MacFarlane
	When `--top-level-division=chapter` is used, a paragraph with section properties is inserted before each level-1 heading. By default, this causes the new heading to start on a new page (though this default can be adjusted in Word). This change should also make it possible to number footnotes by chapter (#2773), though that change isn't yet made.
2024-12-22	Markdown writer: avoid collapsing of initial/final newline in...	John MacFarlane
	...markdown raw blocks. For motivation see #10477.
2024-12-22	RST reader: handle explicit reference links (#10485)	silby
	This case was missed when changing the reference link strategy for RST to allow a single pass. Closes #10484.
2024-12-20	Correct example in charsInBalanced	Evan Silberman
	The given example wasn't actually functional because `anyChar` parses a `Char` and `charsInBalanced` wants a `Text` parser as its inner parser.
2024-12-20	Mediawiki writer: escape line-initial characters...	John MacFarlane
	...that would otherwise be interpreted as list starts. Closes #9700.
2024-12-20	LaTeX writer: properly handle boolean value for `csquotes` variable.	John MacFarlane
	Closes #10403.
2024-12-19	Mention typst in PandocUnknownWriterError for pdf	Evan Silberman

2024-12-19	Allow `--shift-heading-level-by=-1` to work in djot...	John MacFarlane
	...in the same way it works for other formats (with the top-level heading being promoted to metadata title). This needed special treatment because of the way djot surrounds sections with Divs. Closes #10459.
2024-12-19	T.P.mediaBag insertMedia: fast path for data URIs.	John MacFarlane
	Avoid the slow URI parser from network-uri on large data URIs. See #10075. In a benchmark with a large base64 image in HTML -> docx, this patch causes us to go from 7942 GCs to 3654, and from 3781M in use to 1396M in use. (Note that before the last few commits, this was running 9099 GCs and 4350M in use.)
2024-12-19	T.P.Class: shortcut for base64 data URIs in `downloadOrRead`.	John MacFarlane
	This avoids calling the slow URI parser from network-uri on data URIs, instead calling our own parser. Benchmarks on an html -> docx conversion with large base64 image: GCs from 7942 to 6695, memory in use from 3781M to 2351M, GC time from 7.5 to 5.6. See #10075.
2024-12-19	T.P.URI: pBase64DataURI now returns mime + bytes	John MacFarlane

2024-12-19	T.P.MIME: fix `extensionFromMimeType`.	John MacFarlane
	We had a few special cases encoded, but as previously written they wouldn't work properly with modifiers like `;charset=utf-8`.
2024-12-19	Change `--template` to allow use of extensionless templates.	John MacFarlane
	The intent is to allow bash process substitution: e.g., `--template <(echo "foo")`. Previously pandoc always added an extension based on the output format, which caused problems with the absolute filenames used by bash process substitution (e.g. `/dev/fd/11`). Now, if the template has no extension, pandoc will first try to find it without the extension, and then add the extension if it can't be found. So, in general, extensionless templates can now be used. But this has been implemented in a way that should not cause problems for existing uses, unless you are using a template `NAME.FORMAT` but happen to have an extensionless file `NAME` in the template search path. Closes #5270.
2024-12-18	HTML writer: avoid calling parseURIString for data URIs.	John MacFarlane
	This was done to determine the "media category," but we can get that directly from the mime component of data: URIs. Profiling revealed that a significant amount of time was spent in this function when a file contained images with large data URIs. Contributes to addressing #10075.
2024-12-18	Further improvements to base64 data URI parsing.	John MacFarlane
	Text.Pandoc.URI: export `pBase64DataURI`. Modify `isURI` to use this and avoid calling network-uri's inefficient `parseURI` for data URIs. Markdown reader: use T.P.URI's `pBase64DataURI` in parsing data URIs. Partially addresses #10075. Obsoletes #10434 (borrowing most of its ideas). Co-authored-by: Evan Silberman <[email protected]>
2024-12-18	Markdown reader: Adjust source position in data: URI parser.	John MacFarlane
	This fixes an omission in the last commit.
2024-12-18	Markdown reader: more efficient base64 data URI parsing.	John MacFarlane
	This patch borrows some code from @silby's PR #10434 and should be regarded as co-authored. This is a lighter-weight patch that only touches the Markdown reader. The basic idea is to speed up parsing of base64 URIs by parsing them with a special path. This should improve the problem noted at #10075. Benchmarks (optimized compilation): Converting the large test.md from #10075 (7.6Mb embedded image) from markdown to json, before: 6182 GCs, 1578M in use, 5.471 MUT, 1.473 GC after: 951 GCs, 80M in use, .247 MUT, 0.035 GC For now we leave #10075 open to investigate improvements in HTML rendering with these large data URIs. Co-authored-by: Evan Silberman <[email protected]>
2024-12-18	HTML reader: don't canonicalize data: URIs.	John MacFarlane
	It can be very expensive to call network-uri's URI parser on these. See #10075.
2024-12-18	LaTeX reader: handle `figure*` environment as a figure.	John MacFarlane
	Closes #10472.
2024-12-17	Textile reader: improve parsing of spans.	John MacFarlane
	The span needs to be separated from its surroundings by spaces. Also, a span can have attributes, which we now attach. Closes #9878.
2024-12-17	Textile reader: inline constructors don't trigger if closer...	John MacFarlane
	...is preceded by whitespace. Closes #10414.
2024-12-17	LaTeX writer: use displayquote for block quotes with csquotes.	John MacFarlane
	Closes #10456.
2024-12-17	Typst writer: properly handle data: URIs in images.	John MacFarlane
	We need to produce an svg tag and parse it using `image.decode`. This is slightly roundabout but doesn't require any external libraries. Closes #10460.
2024-12-17	Docx writer: use styleIds not styleNames for Title, Subtitle, etc.	John MacFarlane
	This change affects the default openxml template as well as the OpenXML writer. Closes #10282 (regression introduced in pandoc 3.5).
2024-12-17	Text.Pandoc.PDF: fix temp file extension in `toPdfViaTempFile`.	John MacFarlane
	We used to set this to `.html`, but this seemed inappropriate once we started using this function for `--pdf-engine=typst`. So we changed it in pandoc 3.6 to `.source`. But apparently `wkhtmltopdf` needs it to be `.html`. So now we have added a parameter to `toPdfViaTempFile` that allows the extension to be specified. Closes #10468.
2024-12-14	Use lastMay instead of reverse	Joseph C. Sible

2024-12-14	Store a function instead of a Boolean	Joseph C. Sible
	Instead of storing isDisplay and then always choosing displayMath or math based on that, just store displayMath or math directly.
2024-12-14	Use <$> instead of >>= and return	Joseph C. Sible

2024-12-14	Put the length in the range expression instead of calling take later	Joseph C. Sible

2024-12-14	Remove redundant null check	Joseph C. Sible
	"all f []" is always true, so "null xs \|\| all f xs" can be simplified to just "all f xs".
2024-12-14	Use the definition of unsnoc from base	Joseph C. Sible
	This is more efficient than the existing one.
2024-12-14	Use catMaybes instead of building with maybe and (:) one element at a time	Joseph C. Sible

2024-12-14	Remove several unnecessary layers of indirection from refs	Joseph C. Sible

2024-12-11	Allow YAML bibliographies to be arrays of references.	John MacFarlane
	Previously, they had to be YAML objects with a `references` key. Closes #10452.
2024-12-11	Cosmetic code improvement.	John MacFarlane

2024-12-07	Add copyright info to two modules missing it.	John MacFarlane

2024-12-07	Stylistic tweak.	John MacFarlane

2024-12-07	Ensure that `--sandbox` affects `--embed-resources`.	John MacFarlane
	Previously it did not (contrary to what was implied by the manual), which means that an image with URL `/etc/passwd` would leak an encoded version of that file to HTML output with `--self-contained` or `--embed-resources`, even if `--sandbox` was used. Thanks to Samuel Mortenson for pointing out the issue.
2024-12-07	T.P.App.OutputSettings: add `sandbox'` function.	John MacFarlane
	This computes the sandboxed files from Opt and avoids some code repetition in T.P.App and T.P.App.OutputSettings.
2024-12-07	Docx reader: handle `\b`, `\i`, `\y` modifiers in `XE` index entries.	John MacFarlane
	See #10171.
2024-12-07	HTML reader: parse footnotes defined by dpub-aria roles.	John MacFarlane
	Closes #5294.
2024-12-05	Add mdoc reader	Evan Silberman
	This change introduces a reader for mdoc, a roff-derived semantic markup language for manual pages. The two relevant contemporary implementations of mdoc for manual pages are mandoc (https://mandoc.bsd.lv/), which implements the language from scratch in C, and groff (https://www.gnu.org/software/groff/), which implements it as roff macros. mdoc has a lot of semantics specific to technical manuals that aren't representable in Pandoc's AST. I've taken a cue from the mandoc HTML output and many mdoc elements are encoded as Codes or Spans with classes named for the mdoc macro that produced them. Much like web browsers with HTML, mandoc attempts to produce best-effort output given all kinds of weird and crappy mdoc input. Part of the reason it's able to do this is it uses a very accommodating parse tree and stateful output routines specialized to the output mode, and when it encounters some macro it wasn't expecting, it can easily give up on whatever it was outputting and output something else. I've encoded as much flexibility as I reasonably could into the mdoc reader here, but I don't know how to be as flexible as mandoc. This branch has been developed almost exclusively against mandoc's documentation and implementation of mdoc as a reference, and the real-world manual pages tested against are those from the OpenBSD base system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD install, 17 cause the mdoc reader to exit with a parse error. Any further chasing of edge cases is deferred to future work. Many of the tests in test/Tests/Readers/Mdoc.hs are derived directly from mandoc's extensive regression tests. [API change] Adds readMdoc to the public API
2024-12-05	Parameterize Roff escaping	Evan Silberman
	The existing lexRoff does some stuff I don't want to deal with in mdoc just yet, like lexing tbl, and some stuff I won't do at all, like handling macro and text string definitions and switching between modes. Uses a typeclass with associated type families to reuse most of the escaping code between Roff (i.e. man) and Mdoc. Future work could improve on this so that more lexing code could be shared between Man and Mdoc. Mdoc inherits Roff's surface syntax so hypothetically it makes sense to lex it into tokens that make sense for roff. But it happens that the Mdoc parser is much easier to build with an Mdoc specific token stream. Some discussion in jgm/pandoc#10225 about the rationale. Adds a test for the roff \A escape, which I accidentally dropped support for in an earlier iteration without anything complaining.
2024-12-05	Docx reader: improve index reference support.	John MacFarlane
	Support crossrefs. Clean up and unify switch parsing for fields.
2024-12-05	Docx reader: parse index references as empty Spans.	John MacFarlane
	See #10171.
2024-12-01	Fix comments in TEI writer referring to DocBook (#10430)	silby