Skip to content
Skill

docx-template-translator

by zouchenzhen

AI Summary

Treat the input file as the content source and the Word template as the formatting source. Do not expect pandoc or PDF import to infer template semantics. Build a project-specific Python postprocessor after inspecting the template and the converted body document. Do not treat the bundled starter pip

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "docx-template-translator" skill in my project.

Please run this command in my terminal:
# Install skill into your project
mkdir -p .claude/skills/docx-template-translator && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/docx-template-translator/SKILL.md "https://raw.githubusercontent.com/zouchenzhen/docx-template-translator-skill/main/skills/docx-template-translator/SKILL.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

Description

Adaptive conversion of LaTeX, PDF, or Markdown sources into a complete Word .docx that follows a user-supplied .docx template. Use when pandoc --reference-doc alone is not enough — for thesis, dissertation, report, or institutional Word formatting that needs cover pages, declarations, TOC, heading numbering, captions, three-line tables, equations, citations, and visual verification.

Core Idea

Treat the input file as the content source and the Word template as the formatting source. Do not expect pandoc or PDF import to infer template semantics. Build a project-specific Python postprocessor after inspecting the template and the converted body document. Do not treat the bundled starter pipeline or a preset JSON file as a finished converter for institutional templates. For thesis/dissertation templates, you must create or patch a project-specific pipeline for the concrete template and source project before claiming success.

Workflow

• Identify inputs: • Source: .tex project, .pdf, .md, or an existing rough .docx. • Template: required .docx. • Output location and document metadata. • Inspect the template with the real CLI form: • python scripts/inspect_docx_template.py template.docx --out template_report.json • Create a rough body .docx: • LaTeX/Markdown: use pandoc when available. • PDF: try Word COM import or pdf2docx; prefer PDF only when the original source is unavailable. • Existing DOCX: use it as the rough body source. • Write or patch a project-specific Python pipeline: • Start from scripts/adaptive_docx_pipeline.py. • Copy it into the run/output directory or project workspace before patching; do not edit the bundled script in place for a one-off conversion. • Decide, from the template inspection, which template paragraphs/tables/sections are reusable and which are sample placeholders to delete. • Mark protected native-template regions before coding. For thesis templates, cover pages, English cover pages, originality/declaration pages, authorization pages, signatures, and their section breaks are protected by default until the first generated abstract/body marker. • Replace or fill template front matter such as cover pages, declarations, abstracts, keywords, TOC placeholders, headers, footers, page numbering, and section breaks when the source provides those fields. • In protected regions, replace text inside existing paragraphs/runs/tables without deleting and rebuilding the paragraph. Preserve paragraph styles, run fonts/sizes/bold, alignment, spacing, and page breaks unless the user explicitly asks to alter the template. • Insert the rough body at the real body start or rebuild the document around the template parts. Do not blindly append the rough body to the end of the template. • Copy template front matter if needed. • Append rough body content while remapping DOCX relationships. • Remap copied style IDs by visible style name before applying formatting; otherwise Heading 1/2/3 can silently become an unrelated template style when source and template style IDs collide. • Remap styles to the template's real body, heading, caption, reference, and TOC styles. • Scope global formatting passes to generated content only, for example with formatting_start_marker. Never run body-style remapping across native cover/declaration pages. • Clean or rebuild section header/footer references when deleting sample template sections; stale back-matter headers such as 致谢 must not appear on body pages. • Add or repair figure/table captions, table borders, hyperlinks, bookmarks, citations, and page breaks. • Finalize with Microsoft Word when available: • Use scripts/finalize_word_docx.py to update fields/TOC and export a PDF preview. • Automated and visual verification: • Use scripts/validate_docx_conversion.py final.docx --template template.docx --protected-until "中 文 摘 要" --pdf final.pdf --out validation.json for placeholder/order/header/image/table checks plus protected-front-matter format checks. Choose the real first generated marker for non-Zhengzhou templates. • Then run scripts/validate_docx_render.py final.docx --pdf final.pdf --out validation_render.json for render-level checks: TOC field presence, numId↔abstractNum consistency, multilevel heading format, reference-counter independence, body-header static-text leakage, and PDF field-error strings. The structural validator can return PASS while the document is visibly broken; the render validator is what catches "empty TOC", "chapters not auto-numbered", "references start at [47]", "body header still says 致谢", and "STYLEREF prints 错误!使用'开始'选项卡…". • Use scripts/render_pdf_preview.py to inspect cover pages, abstracts, TOC, representative tables, figures, formulas, and references.

Mandatory Quality Gate

Before reporting success, run an automated and visual QA pass. If any check fails, patch the project-specific pipeline and rerun; do not present the output as complete. • Confirm the rough body is not appended after a back-matter placeholder such as 致谢, Acknowledgements, 参考文献, or sample appendices. • Confirm template placeholder text is gone or intentionally preserved. Common failures include names like 李四, 王五, 张三, red formatting instructions, lorem ipsum, sample chapter headings, and template-only reference lists. • Confirm source metadata and source front matter replaced the template placeholders: title, author, advisor, major/department, date, Chinese abstract, English abstract, keywords, declarations when applicable. • Confirm protected front matter still matches the template's formatting. Content may change, but cover/declaration/signature pages must preserve paragraph styles, run-level fonts/sizes/bold, spacing, alignment, and page-break structure unless explicitly modified. • Confirm TOC entries point to the generated source chapters, not only to the template's sample chapters. • Confirm heading paragraphs are still heading styles after OOXML insertion; style ID collisions must not break TOC generation. • Confirm body pages use the intended body style and do not inherit the last template section's header/footer. • Confirm representative images, formulas, tables, captions, references, and citations survive the reconstruction. • Record failures in the run report with PASS/FAIL/PARTIAL wording and concrete evidence.

Render-level Quality Gate (`validate_docx_render.py`)

The structural quality gate above checks counts and presence. It can return PASS while the rendered Word/PDF is visibly broken because pandoc-derived DOCX bodies often ship with a TOC paragraph that has no field, a Heading 1 style with no <w:numPr>, a numId rebound to a single-level abstract during reference repair, or a body section header whose static text is "致谢". Run validate_docx_render.py after validate_docx_conversion.py to catch those: • TOC field presence: <w:fldChar w:fldCharType="begin"> plus <w:instrText> TOC . If absent, Word's "update fields" cannot populate a non-existent TOC. Use scripts/inject_toc_field.py to add one before finalization. • numId ↔ abstractNum consistency: every (numId, ilvl) pair used by a paragraph or by a style's <w:numPr> must resolve to a defined <w:lvl ilvl=N> inside the bound abstract numbering. Missing levels silently fall back to level 0 — that is how 1.1 / 1.1.1 headings collapse to [1] after a reference repair re-points numId=1 at a single-level abstract. • Multilevel heading format: the abstract numbering bound to Heading 1 (whether at style level or via inline numPr on body H1 paragraphs) must have lvlText matching the user-supplied chapter prefix pattern (default 第%1章 or Chapter %1) at level 0 and a multilevel pattern (default contains both %1 and %2) at levels 1/2. Configure with --chapter-prefix-pattern and --multilevel-pattern for non-default templates. • Reference counter independence: any non-heading paragraph appearing after the last 参考文献 / References Heading 1 must not reuse a numId already used by Heading 1/2/3. This is the bug where 33 references render as [47]–[79] because their counter was shared with H2/H3 paragraphs upstream. • Body header is not a back-matter literal: for every body section that uses a <w:headerReference>, the referenced headerN.xml must either contain a Word field (<w:fldChar>) or its static text must not equal 致谢 / Acknowledgements / 参考文献 / 附录 / 攻读学位期间…. The recommended fix is scripts/set_styleref_header.py --style-id 1 so the header dynamically shows the current chapter title. • PDF field errors absent: scan the exported PDF for the localized field-error strings (错误!, Error!, !Reference source not found, !未找到引用源). These appear when STYLEREF/PAGEREF/REF can't resolve their target and are an immediate FAIL. • Figure count vs source (--source-latex-dir <dir> or --min-figures <N>): rendered <w:drawing> count must be ≥ the LaTeX project's \includegraphics count. Pandoc silently drops figures whose path lacks an explicit extension (e.g. \includegraphics{thesis_structure}) when the basename also matches a vector file (.pdf / .vsdx). The structural validator counts what was embedded; this check tells you what should have been embedded. • Table border style (--expected-table-style three-line for Chinese-thesis templates, default any): every data table must classify as a recognizable three-line layout (top heavy, header-row bottom thin, last-row bottom heavy, vertical edges nil) or a tblStyle that may carry borders. A docx that contains 20 borderless tables when the template requires three-line tables is the failure mode this catches. Layout/wrapper tables (≤ --table-min-data-rows rows) are skipped. • Citation coverage (--source-latex-dir <dir> or `--min-citati

Discussion

0/2000
Loading comments...

Health Signals

MaintenanceCommitted 15d ago
Active
AdoptionUnder 100 stars
46 ★ · Niche
DocsREADME + description
Well-documented

GitHub Signals

Stars46
Forks5
Issues1
Updated15d ago
View on GitHub
Apache-2.0 License

My Fox Den

Community Rating

Sign in to rate this booster

Works With

Claude Code