aboutsummaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/Docx
AgeCommit message (Collapse)Author
2026-01-07Docx reader: handle tables without tblGrid.John MacFarlane
Closes #11380.
2025-12-12Fix some more imports involving foldl'.John MacFarlane
2025-11-30Docx reader: Handle REF link instruction (#11296)Ezwal
This PR aims to handle a common run field instruction (fieldInstr) from docx format : REF, specifically those with the "link" switch \h. In word software, you can create REF field instruction with the Cross-reference button. You can create cross-reference to many things such as Equation, Table, Title...
2025-11-24Support pptx (PowerPoint) as an input format.Anton Antich
New module `Text.Pandoc.Readers.Pptx`, exporting `readPptx`. [API change] Factored out some common OOXML functions from Text.Pandoc.Readers.Docx.Util into a non-exported module Text.Pandoc.Readers.OOXML.Shared.
2025-11-12Docx reader: check recursively for caption styles.Albert Krewinkel
The docx reader uses caption styles to identify figures and captioned tables. It now checks for known caption styles in the full styles hierarchy of a paragraph instead of just checking the style directly. This allows to recognize caption styles that are built on top of the basic *caption* style, as is sometimes the case in sophisticated styles.
2025-09-17Docx reader: change default for textwidth.John MacFarlane
This should only be used if sectPr is not found.
2025-09-17Docx reader: properly calculate table column widths.John MacFarlane
Previously we assumed that every table took up the full text width. Now we read the text width from the document's sectPr. Closes #9837. Closes #11147.
2025-09-06Docx reader: better handling of AlternateContent.John MacFarlane
This revises the solution to #9214 in commit 2e8ecb3 in order to handle a standard Word way of inserting emojis. Closes #11113.
2025-09-06Partially undo commit 2e8ecb3.John MacFarlane
This was too heavy-handed a fix, and it interferes with processing Word emojis (#11109).
2025-07-27Docx reader: fix `stringToInteger`.John MacFarlane
It previously converted things like `11ccc` to an integer; now it requires that the whole string be parsable as an integer. Closes #9184.
2025-06-03Docx reader: handle strict OpenXML as well as transitional.John MacFarlane
Closes #7691.
2025-03-19T.P.Readers.Docx.Util: use xml-lights's `onlyElems`...John MacFarlane
...instead of defining it again.
2025-02-19Revert "Docx reader and writer: support row heads."John MacFarlane
This reverts commit cbe67b9602a736976ef6921aefbbc60d51c6755a. Word sets `w:firstColumn="1"` by default for tables. You have to find the Table Design tab and explicitly uncheck "First Column" to make this go away. In most cases, I don't think writers intend to designate the first column as a row head, so this commit is going to produce unexpected results. In addition, because of the table normalization done by pandoc-type's `tableWith`, any table containing a colspanned cell in the left-hand column will get broken if the first column is designated a row head. For these reasons it seems best to revert this change, which was made in response to #9495. Closes #10627.
2025-01-10Docx reader and writer: support row heads.John MacFarlane
Reader: When `w:tblLook` has `w:firstColumn` set (or an equivalent bit mask), we set row heads = 1 in the AST. Writer: set `w:firstColumn` in `w:tblLook` when there are row heads. (Word only allows one, so this is triggered by any number of row heads > 0.) Closes #9495.
2025-01-10Docx reader: read table styles as custom styles...John MacFarlane
...when `styles` extension is enabled. Closes #9603. Also improve manual's coverage of custom styles.
2024-12-07Docx reader: handle `\b`, `\i`, `\y` modifiers in `XE` index entries.John MacFarlane
See #10171.
2024-12-05Docx reader: improve index reference support.John MacFarlane
Support crossrefs. Clean up and unify switch parsing for fields.
2024-12-05Docx reader: parse index references as empty Spans.John MacFarlane
See #10171.
2024-10-03Docx reader: reset lists after headers in same list numId.John MacFarlane
Headings in docx, even ones that do not have a visible number, can have a numId, and in odd cases can even share a numId with a list that continues after the header. In this case the list numbering should be reset by the header. To accomplish this, we add a Heading constructor to BodyPart and include on it all the information list items have. Closes #10258.
2024-09-08Remove most uses of partial function 'head'.John MacFarlane
2024-06-12Docx reader: improve handling of captions.John MacFarlane
- Turn captioned images into Figure elements. Closes #9391. - Improve the logic for associating elements with captions. Closes #9358. - Ensure that captions that can't be associated with an element aren't just silently dropped. Closes #9610.
2024-06-12Docx reader: rename TblCaption to Capt.John MacFarlane
We'll use this for image captions as well. Word does not really distinguish these.
2024-06-04Docx reader: support task lists.John MacFarlane
This also fixes a small bug in parsing delimiters in numbered lists, which led to the default delimiter being used wrongly in some cases. Closes #8211.
2024-06-04T.P.Readers.Docx.Lists: replace a generic traversal...John MacFarlane
using `bottomUp` with a faster one using `walk`.
2024-06-01Docx reader: react to "left" value on jc attribute.John MacFarlane
Also fix tests.
2024-06-01Docx reader: handle column and cell alignments.John MacFarlane
OpenXML doesn't have a way of indicating column alignments, but we guess them by looking at the justification property on the first paragraph of a cell, if there is one. We take the column alignments from the first body row. Closes #8551.
2024-06-01Docx reader: allow insertion/deletion to contain arbitrary ParParts...John MacFarlane
...and not just Runs. This fixes a problem wherein comments inside insertions or deletions would be ignored. Closes #9833.
2024-06-01Support HorizontalRule in docx reader.John MacFarlane
We support both pandoc-style and the style described at https://support.microsoft.com/en-us/office/insert-a-horizontal-line-9bf172f6-5908-4791-9bb9-2c952197b1a9 Closes #6285.
2024-06-01T.P.Readers.Docx.Parse: add HRule constructor to BodyPart.John MacFarlane
This paves the way to supporting horizontal rules in the reader. We still need to adjust the parser to create HRule appropriately; so far, this change has no effect, but it's a step on the way to #6285.
2024-04-25Update copyright dates to 2024.John MacFarlane
2024-02-28Docx reader: ensure that table captions are counted.John MacFarlane
Normally these occur outside the table element itself, but they should still be parsed as captions in this case. Closes #9518.
2024-02-28Docx reader: detect caption by style name not id.John MacFarlane
The styleId can change depending on the localization. Partially resolves #9518.
2023-12-26fix(docx): support absolute header/footer pathsEdwin Török
Header and footer references may be absolute in the reference.docx. E.g. editing it with dotnet's Open-XML-SDK causes this error: ``` + pandoc test.md -t docx --reference-doc referenceh.docx -o test.docx word//word/header1.xml missing in reference docx ``` There was already code in pandoc to handle relative vs absolute paths in references, so use it. Signed-off-by: Edwin Török <[email protected]>
2023-12-18Docx reader: fix HYPERLINK with only switch and no argument.John MacFarlane
The argument can apparently be omitted, and then we just have a fragment URL. Closes #9246.
2023-12-11Whitespace fix.John MacFarlane
2023-11-29Docx reader: unwrap content of shaped textboxes...Stephan Meijer
* #9214 text in shape format test document * #9214 support Text in Shape Format * #9214 remove irrelevant code
2023-11-28Docx reader: Improve handling of w:sym.John MacFarlane
Add T.P.Readers.Docx.Symbols. This gives us a table to use to resolve characters included in docx via w:sym element. Use this table to resolve characters when symbol fonts are specified. Closes #9220.
2023-11-28Correct comment.John MacFarlane
2023-08-18Docx reader: omit "Table NN" from caption.John MacFarlane
Closes #9002.
2023-07-14Docx reader: use SVG version of image if present.John MacFarlane
Previously the backup PNG was exported even if an SVG was present, but the SVG should be preferred. Closes #7244.
2023-02-18Docx reader: parse image alt texts in LibreOffice generated filesAlbert Krewinkel
LibreOffice tags images slightly differently than Word; this change lets the parses take that difference into account when looking for an image description (alt text).
2023-01-10Update copyright years, it's 2023!Albert Krewinkel
2022-12-11Docx reader: fix handling of oMathPara in w:p with other content.John MacFarlane
Closes #8483. The problem is that oMathPara can either occur at the block-level (child of w:body) or at the inline level (child of w:p, potentially with other content). We need to handle both cases. Previously the code just assumed that if we had a w:p with an oMathPara, the math would be the sole content. This patch removes OMathPara as a constructor of BodyPart and adds it as a constructor of ParPart.
2022-11-19Docx reader: Support parsing of highlighted text.John MacFarlane
2022-10-31First stab at mtl 2.3 compliance.John MacFarlane
This will no doubt produce a bunch of warnings and hence CI failures, which we'll need to work around with explicit imports.
2022-10-16T.P.Parsing: Remove gratuitious renaming of Parsec types.John MacFarlane
We were exporting Parser, ParserT as synonyms of Parsec, ParsecT. There is no good reason for this and it can cause confusion. Also, when possible, we replace imports of Text.Parsec with T.P.Parsing. The idea is to make it easier, at some point, to switch to megaparsec or another parsing engine if we want to. T.P.Parsing new exports: Stream(..), updatePosString, SourceName, Parsec, ParsecT [API change]. Removed exports: Parser, ParserT [API change].
2022-10-15Minor code cleanups.John MacFarlane
2022-09-27Fix small whitespace things.John MacFarlane
2022-08-30Docx reader: mark unnumbered headings with class 'unnumbered'Albert Krewinkel
If a document uses numbered headings, then headings without numbers are marked with class `unnumbered`, the default class used by pandoc to convey this kind of information. The classes are not added if none of the headings in a document are. This change ensures good conversion results when converting with `--number-sections`. Closes: #8148
2022-02-04Docx reader: parse EN.CITE and EN.REFLIST fields.John MacFarlane