Skip to content

Improve polygonize performance: single-pass tracing, JIT merge helpers, batch shapely #1008

@brendancol

Description

@brendancol

Author of Proposal: @brendan

Reason or Problem

_follow traces every polygon boundary twice -- once to count vertices, then again to fill the coordinate array. This doubles boundary-tracing cost, which is already the dominant phase for rasters with many small regions.

The dask chunk-merge path runs _point_in_ring, _simplify_ring, and _signed_ring_area as plain Python loops. On chunked rasters with many boundary-crossing polygons, these show up in profiles.

_to_geopandas builds Polygon objects one at a time. Shapely 2.0 has a batch polygons() function (compiled C) that does this faster.

Proposal

Design:

No API changes -- these are all internal.

Rewrite _follow to trace in a single pass using a dynamically-grown numpy buffer. Start at 64 points, double when full, trim after the loop. The O(log n) reallocation overhead is small next to a full second traversal of the boundary.

Add @ngjit to _point_in_ring, _simplify_ring, and _signed_ring_area. Numeric loops over numpy arrays, no Python objects -- numba handles them fine.

In _to_geopandas, batch-construct hole-free polygons with shapely.polygons(). Polygons with holes keep using the scalar constructor. Shapely < 2.0 gets the current code path via a version check.

Usage:

No user-facing changes. Existing polygonize() calls get faster.

Value:

The tracing rewrite speeds up all four backends. JIT merge helpers mainly help dask, where boundary merging can eat a real fraction of total time. Batch shapely helps geopandas output.

Stakeholders and Impacts

Internal only. The existing benchmark suite (benchmarks/benchmarks/polygonize.py) covers before/after measurement.

Drawbacks

The single-pass buffer growth in _follow is harder to read than the current two-pass logic. The shapely batch path needs shapely >= 2.0 (Dec 2022), gated by a runtime check.

Alternatives

Could pre-compute vertex counts from region perimeters, but that's itself another pass over the data. Numba typed lists instead of buffer doubling have worse per-append overhead.

Unresolved Questions

None.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions