Skip to content

performance: image processing optimizations#638

Merged
JackByrne merged 11 commits into
elapouya:devfrom
start-software:develop
May 18, 2026
Merged

performance: image processing optimizations#638
JackByrne merged 11 commits into
elapouya:devfrom
start-software:develop

Conversation

@JackByrne
Copy link
Copy Markdown
Collaborator

Summary

This pull request introduces substantial performance optimizations for inline image handling within the docxtpl library.

The changes focus on reducing redundant XML generation, file I/O, hashing, and image processing during template rendering. Together, these improvements dramatically reduce rendering times for image-heavy documents.

In a real-world example containing approximately 850 images, rendering time was reduced from 45–50 seconds to approximately 2–3 seconds.


Key Improvements

Inline Image XML Generation Optimizations

  • Added a pre-built inline image XML template (_INLINE_IMAGE_XML) generated once at module load time.
  • Image XML is now produced using lightweight str.format() operations instead of repeatedly invoking CT_Inline.new_pic_inline().
  • This avoids expensive XML parsing and object construction for every image insertion.

Inline Image Caching

  • Updated InlineImage._insert_image() to cache generated image XML and related processing.
  • Cache keys are based on:
    • document part
    • image descriptor
    • width
    • height

This prevents repeated:

  • file reads
  • image hashing
  • XML generation
  • relationship creation

for images reused throughout a document.


Internal Image Part Deduplication

Fast Image Lookup & Reuse

Added:

  • _image_cache
  • _init_image_parts_index()
  • _get_or_add_image_part()

to support fast, O(1) image deduplication and retrieval.

Improvements over Default python-docx Behaviour

The new implementation bypasses the default python-docx image deduplication mechanism, which relies heavily on content hashing and repeated package inspection.

Instead:

  • image parts are indexed by file path
  • previously inserted images are reused directly
  • duplicate image processing is avoided entirely

This significantly improves rendering performance for templates containing many images.


Reduced File I/O and Processing Overhead

The _get_or_add_image_part() implementation ensures:

  • each unique image file is only added to the document package once
  • duplicate image relationships are reused
  • unnecessary hashing and binary processing are avoided

This results in substantially lower CPU and I/O overhead during rendering.


Real-World Performance Impact

Scenario Before After
Document containing ~850 images ~45–50 seconds ~2–3 seconds

These optimizations provide major performance improvements for image-heavy templates while preserving existing rendering behaviour and compatibility.

JackByrne added 11 commits May 18, 2026 15:58
Avoid calling python-docx per-image by generating a CT_Inline-based XML template once and using str.format() to fill sentinels (keeping compatibility with installed python-docx). Add caching of generated image XML per (part, descriptor, width, height) to skip repeated I/O, SHA1 work and header parsing. Use package.get_or_add_image_part and relate_to with RT.IMAGE, compute scaled_dimensions, assign shape_id from docx_ids_index, and xml-escape filenames. Also add a _image_cache dict on DocxTemplate and adjust hyperlink handling to use the local part variable.
Add an O(1) SHA1 index for image parts and a fast _get_or_add_image_part helper on DocxTemplate to avoid python-docx's O(n) linear scan and repeated SHA1 recomputation. Initialize the index in the constructor (_init_image_parts_index), seed it from existing image parts, and maintain a sequential partname counter to prevent partname collisions. Update InlineImage to call tpl._get_or_add_image_part (which returns (image_part, image)) instead of package.get_or_add_image_part, and use the returned Image object. This improves performance and reduces redundant SHA1 work when inserting/looking up images.
Replace the SHA1-based image-part index with a descriptor-keyed cache (_image_descriptor_index) to deduplicate images by file-path (O(1)) and avoid expensive SHA1 hashing. For string path descriptors the cache is used to return existing (image_part, image) tuples; non-string descriptors (e.g. file-like objects) fall back to always creating a new part. Keeps sequential partname assignment and appends new ImagePart to the package; caches the result for string descriptors. This improves performance when adding many images (e.g. large photos) by eliminating repeated SHA1 computation.
Cache only the expensive image metadata (rId, dimensions, filename) per (part, descriptor, width, height) instead of the full inline XML. A fresh shape_id is now assigned for every insertion so drawing IDs remain unique (important for headers/footers/footnotes which aren't renumbered by fix_docpr_ids()). This preserves performance benefits (avoids repeated image part lookup, hashing and header parsing) while preventing duplicate drawing IDs; cx/cy are stored as ints and filename is xml-escaped when cached.
Use id() for non-hashable image descriptors (e.g. file-like objects) when building the image cache key to avoid TypeError on dict lookup. Also escape double quotes in image filenames for XML attribute usage by passing a mapping to xml_escape so quotes become ". Cache semantics and per-insertion shape_id assignment are otherwise unchanged.
Avoid using len() of image parts to pick the next image partname index, which could collide when numbering is non-contiguous. Instead scan existing image partnames (using partname.baseURI when available, otherwise str(partname)), extract numeric suffixes with a regex (/image(\d+)\.), track the maximum index, and set the image part counter to that max. This ensures new image partnames won't reuse an already-present index.
Replace conditional use of partname.baseURI with a direct str(partname) conversion when iterating image parts. This makes the code rely on a consistent string representation for part names (used by the /imageN.ext regex) and avoids depending on the presence of a baseURI attribute across different part implementations.
Replace the hardcoded docx_ids_index initialization with a routine that scans all package parts (body, headers, footers, footnotes) for wp:docPr elements and sets the counter above the maximum found id (minimum 1000). This prevents id collisions when inserting new drawings into parts that were not renumbered by fix_docpr_ids. The new method is called during initialization and safely skips non-XML or unreadable parts.
Treat image.filename == None (e.g., BytesIO/file-like descriptors) as an empty string before calling xml_escape so XML attribute generation matches python-docx behavior. Added a clarifying comment and ensure the escaped filename is stored in the cache to avoid None-related issues when rendering.
Only build and use a cache key when the image_descriptor is hashable. Previously id() was used for non-hashable descriptors (e.g. file-like objects), which could risk aliasing after GC and lead to incorrect deduplication. Now the code attempts to construct a cache key with the descriptor and falls back to skipping caching for unhashable descriptors; cache entries are only read/written when a valid cache_key exists. Filename normalization and per-insertion shape_id behavior are unchanged.
performance: image processing optimizations
@JackByrne JackByrne self-assigned this May 18, 2026
@JackByrne JackByrne merged commit 177822b into elapouya:dev May 18, 2026
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant