[Epic] Performance & Scale — lazy load, image dedup, large-file handling

## Problem

`Presentation()` parses every part eagerly. For a 200-slide media-heavy deck, every image is fully decoded, every chart is fully parsed, and the entire OOXML tree sits in memory before the user calls a single method. Issues report "poor performance when creating a big presentation" ([scanny/python-pptx#644](https://github.com/scanny/python-pptx/issues/644), 3c), unclosed-file ResourceWarnings ([#461](https://github.com/scanny/python-pptx/issues/461)), and Docker container errors ([#796](https://github.com/scanny/python-pptx/issues/796)). For comparison: the recently-launched [`office-oxide`](https://pypi.org/project/office-oxide/) Rust extractor is **~46× faster** than python-pptx for read-only text extraction. We're not aiming to match that — we want to write — but lazy loading and image deduplication are real wins for the largest workloads.

## Sub-features

- [ ] Lazy part loading: `Presentation()` reads only the rels graph; individual slide parts are parsed on first access
- [ ] Image-blob deduplication across packages: when copying slides between decks (overlaps with Slide CRUD epic), shared image blobs reuse the same `image1.png` part instead of duplicating
- [ ] Image-blob deduplication within a session: when the same image is added twice via `add_picture`, only one part is created (already partial — extend across cross-package merging)
- [ ] Streaming write: `pres.save(stream)` does not require the full document tree to be assembled in memory before writing; chunked zip-stream
- [ ] Resource cleanup: ensure `ZipFile` objects are closed, eliminate `ResourceWarning: unclosed file`
- [ ] Profiling instrumentation: optional `Presentation(...)` `profile=True` mode that emits per-part parse/serialize timings to stderr
- [ ] Benchmark suite: `tests/bench/` with reference 200-slide media-heavy fixture and threshold-asserting microbenchmarks

## Prior art

- **Open PRs:** none directly addressing lazy load.
- **Forks:**
  - [`yfedoseev/office_oxide`](https://github.com/yfedoseev/office_oxide) — Rust read-only extractor (reference for what's possible).
  - [`Touzen` and `loadfix`](https://github.com/loadfix/python-pptx) branches contain `iter_leaf_shapes()` style ergonomics that overlap with lazy traversal.
- **User issues this would close:** [#327](https://github.com/scanny/python-pptx/issues/327), [#461](https://github.com/scanny/python-pptx/issues/461), [#478](https://github.com/scanny/python-pptx/issues/478), [#548](https://github.com/scanny/python-pptx/issues/548), [#644](https://github.com/scanny/python-pptx/issues/644), [#732](https://github.com/scanny/python-pptx/issues/732), [#796](https://github.com/scanny/python-pptx/issues/796), [#813](https://github.com/scanny/python-pptx/issues/813).
- **POI parity:** XSLF lazy-loads slide parts via `XSLFSlide.getXmlObject()` (proof that it's possible inside an OOXML library).
- **Code paths:** `src/pptx/opc/package.py`, `src/pptx/opc/serialized.py`, `src/pptx/parts/image.py`.

## Acceptance criteria

- Opening a 200-slide media-heavy benchmark deck completes in ≤30% of current wall time and ≤50% of current peak RSS.
- Image dedup test: copying 50 slides each containing the same logo produces exactly 1 image part.
- No `ResourceWarning` raised in the test suite under `python -W error::ResourceWarning`.
- Existing 2986 pytest tests continue passing.
- Behave scenarios for benchmark thresholds.

## Effort: L

Cross-cutting, requires careful re-plumbing of `OpcPackage` parsing. Recommend benchmark-first delivery: ship the bench suite as Phase A so improvements are measurable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Performance & Scale — lazy load, image dedup, large-file handling #27

Problem

Sub-features

Prior art

Acceptance criteria

Effort: L

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Epic] Performance & Scale — lazy load, image dedup, large-file handling #27

Description

Problem

Sub-features

Prior art

Acceptance criteria

Effort: L

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions