Problem
Presentation() parses every part eagerly. For a 200-slide media-heavy deck, every image is fully decoded, every chart is fully parsed, and the entire OOXML tree sits in memory before the user calls a single method. Issues report "poor performance when creating a big presentation" (scanny/python-pptx#644, 3c), unclosed-file ResourceWarnings (#461), and Docker container errors (#796). For comparison: the recently-launched office-oxide Rust extractor is ~46× faster than python-pptx for read-only text extraction. We're not aiming to match that — we want to write — but lazy loading and image deduplication are real wins for the largest workloads.
Sub-features
Prior art
- Open PRs: none directly addressing lazy load.
- Forks:
- User issues this would close: #327, #461, #478, #548, #644, #732, #796, #813.
- POI parity: XSLF lazy-loads slide parts via
XSLFSlide.getXmlObject() (proof that it's possible inside an OOXML library).
- Code paths:
src/pptx/opc/package.py, src/pptx/opc/serialized.py, src/pptx/parts/image.py.
Acceptance criteria
- Opening a 200-slide media-heavy benchmark deck completes in ≤30% of current wall time and ≤50% of current peak RSS.
- Image dedup test: copying 50 slides each containing the same logo produces exactly 1 image part.
- No
ResourceWarning raised in the test suite under python -W error::ResourceWarning.
- Existing 2986 pytest tests continue passing.
- Behave scenarios for benchmark thresholds.
Effort: L
Cross-cutting, requires careful re-plumbing of OpcPackage parsing. Recommend benchmark-first delivery: ship the bench suite as Phase A so improvements are measurable.
Problem
Presentation()parses every part eagerly. For a 200-slide media-heavy deck, every image is fully decoded, every chart is fully parsed, and the entire OOXML tree sits in memory before the user calls a single method. Issues report "poor performance when creating a big presentation" (scanny/python-pptx#644, 3c), unclosed-file ResourceWarnings (#461), and Docker container errors (#796). For comparison: the recently-launchedoffice-oxideRust extractor is ~46× faster than python-pptx for read-only text extraction. We're not aiming to match that — we want to write — but lazy loading and image deduplication are real wins for the largest workloads.Sub-features
Presentation()reads only the rels graph; individual slide parts are parsed on first accessimage1.pngpart instead of duplicatingadd_picture, only one part is created (already partial — extend across cross-package merging)pres.save(stream)does not require the full document tree to be assembled in memory before writing; chunked zip-streamZipFileobjects are closed, eliminateResourceWarning: unclosed filePresentation(...)profile=Truemode that emits per-part parse/serialize timings to stderrtests/bench/with reference 200-slide media-heavy fixture and threshold-asserting microbenchmarksPrior art
yfedoseev/office_oxide— Rust read-only extractor (reference for what's possible).Touzenandloadfixbranches containiter_leaf_shapes()style ergonomics that overlap with lazy traversal.XSLFSlide.getXmlObject()(proof that it's possible inside an OOXML library).src/pptx/opc/package.py,src/pptx/opc/serialized.py,src/pptx/parts/image.py.Acceptance criteria
ResourceWarningraised in the test suite underpython -W error::ResourceWarning.Effort: L
Cross-cutting, requires careful re-plumbing of
OpcPackageparsing. Recommend benchmark-first delivery: ship the bench suite as Phase A so improvements are measurable.