Optimize SAR BP float-float bin computation by tbensonatl · Pull Request #1183 · NVIDIA/MatX

tbensonatl · 2026-05-16T18:07:37Z

SAR BP kernel optimizations providing ~7% uplift for fltflt path

Apply two optimizations to the SAR BP kernel:

    - For the fltflt path, use a fused ComputeBinToPixelFloatFloat() function to compute the range bin for a pixel-pulse pair. This fused version reduces the total number of float-float normalizations while handling the possibility that all .hi components cancel to zero for the antenna and pixel position differences. Those cases will likely be relaxed for a future MCP-centered variant of the SAR backprojector where an antenna position at the mocomp point would not be realistic.
    - For all paths, convert the floating-point range bin floor to int32_t instead of index_t. This removes operations from the FP64 pipeline if index_t is 64-bit. We already implicitly required that the number of range bins <= 2^24 for correctness of the floorf(bin) calculations, but now that is explicitly documented and verified via a run-time assertion.

Refactor the SAR BP example so that it passes fltflt values for the antenna positions
and range-to-mcp values when using the fltflt precision. This prevents the kernel
from having to perform the double->fltflt conversions during its pulse block preamble.
The refactor creates a new runner function so that it can be templatized on the
position and rtm types.

Update the example viewer script to use a percentile-based contrast stretch.
Replace the example image with one generated from the newer script. This approach
generates images where the bright scatterers no longer dominate the dynamic range,
resulting in much brighter midtones.

Apply two optimizations to the SAR BP kernel: - For the fltflt path, use a looser version of fltflt_sub() to compute the component-wise pixel-to-antenna distances. By not fully normalizing the values, we save some computation at very low accuracy impact for typical SAR scenarios. - For all paths, convert the floating-point range bin floor to int32_t instead of index_t. This removes operations from the FP64 pipeline if index_t is 64-bit. We already implicitly required that the number of range bins <= 2^24 for correctness off the floorf(bin) calculations, but now that is explicitly documented and verified via a run-time assertion. Refactor the SAR BP example so that it passes fltflt values for the antenna positions and range-to-mcp values when using the fltflt precision. This prevents the kernel from having to perform the double->fltflt conversions during its pulse block preamble. The refactor creates a new runner function so that it can be templatized on the position and rtm types. Update the example viewer script to use a percentile-based contrast stretch. Replace the example image with one generated from the newer script. This approach generates images where the bright scatterers no longer dominate the dynamic range, resulting in much brighter midtones. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

an antenna position and pixel cancel. Change to a fully fused SAR BP specialized range-to-bin fltflt function. This removes normalization when possible and exploits other known invariants of the geometry. Overall, the impact of the PR is a 10% uplift in fltflt performance on RTX PRO 6000. Also loosen the 2^24 range bin restriction for ComputeType==Double as it does not apply there. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

… rsqrtf(). This reduces our gain from >10% to 7%. In the future, we will add a fast-path for cases where e.g. we know that the antenna z is far above the pixel, in which case this extra normalization would not be required. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-05-16T18:07:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-16T18:22:22Z

Greptile Summary

This PR delivers two focused optimizations to the SAR backprojection kernel and refactors the example driver to support them. The fltflt bin-computation path is accelerated by ~7% via a new fused ComputeBinToPixelFloatFloat helper that avoids materializing a canonical float-float range R, saving the trailing fast_two_sum normalization inside sqrt_fast. All compute paths benefit from narrowing the range-bin floor index from index_t (64-bit on most targets) to int32_t, removing dead FP64 pipeline operations.

New fused helper ComputeBinToPixelFloatFloat in sar_bp.cuh folds coordinate-diff, squared-norm, Newton-Raphson sqrt, and fma_approx into one device function, with explicit lo×lo term retention to handle fp32 cancellation when the antenna is sub-ULP-close to a pixel.
int32_t bin index replaces index_t throughout the kernel's inner loop; runtime assertions in sar_bp.h enforce the new 2²⁴ cap for fp32 paths and the INT32_MAX cap for all paths, verified by two new TYPED_TEST cases.
Example refactored into a templated run_bp_device helper that accepts either tensor<double3> or tensor<fltflt> position tensors; the fltflt host buffers are converted in-place before upload using memcpy-based type-punning (fixing the strict-aliasing UB flagged in prior review).

Confidence Score: 5/5

The changes are targeted optimizations to a well-tested kernel with new runtime guards and tests; no regressions were identified.

The fused ComputeBinToPixelFloatFloat function handles all documented corner cases (fp32 cancellation, sum_hi=0) via the renormalization step, and the math is internally consistent. The int32_t narrowing of bin indices is protected by new launch-time assertions verified by two dedicated tests. The example refactoring correctly uses memcpy-based type-punning for the in-place fltflt conversion, the previously reported h_image leak on file-open failure is fixed, and the Python viewer's percentile validation is correct.

No files require special attention. The most complex change (ComputeBinToPixelFloatFloat in sar_bp.cuh) is well-commented and mathematically sound.

Important Files Changed

Filename	Overview
include/matx/kernels/sar_bp.cuh	Replaces ComputeRangeToPixelFloatFloat with the fused ComputeBinToPixelFloatFloat (fuses 4 computation stages, retains lo×lo terms for cancellation corner-case, adds renormalize before sqrt). Narrows bin_floor_int and num_range_bins from index_t to int32_t throughout the inner loop.
include/matx/transforms/sar_bp.h	Adds two new launch-time range-bin limit assertions: INT32_MAX for all paths and 2^24 for fp32 paths, with clear error messages. Logic is correct and well-documented.
examples/sarbp/sarbp.cu	Introduces BpRunCtx aggregate and templated run_bp_device to share code between double and fltflt paths. In-place fltflt conversion uses memcpy-based type-punning (correct, standards-compliant). Previously reported h_image leak on file-open failure is now fixed.
test/00_transform/SarBp.cu	Two new boundary tests (RangeBinsLimitFp32Path, RangeBinsLimitDoublePath) correctly verify the 2^24 inclusive upper bound for fp32 paths and the exemption for Double compute type; use generator tensors to avoid large allocations.
examples/sarbp/view_sarbp_image.py	Adds percentile-based contrast stretch as the new default; --dynamic-range becomes an override. Input validation (0 <= LO < HI <= 100) is present and correct.
include/matx/operators/sar_bp.h	Documentation-only addition: documents the 2^24 fp32 bin-count constraint and the int32_t indexing limit for all compute types.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ComputeBinToPixelFloatFloat(apx,apy,apz, px,py,pz, dr_inv, mcp_partial)"]
    A --> B["Stage 1: Loose coordinate differences\ndx = fltflt_two_sum(px, -apx.hi)\ndx.lo -= apx.lo  (same for dy, dz)"]
    B --> C["Stage 2: Squared-norm with lo×lo retention\npx2 = two_prod_fma(dx.hi, dx.hi)\nsum_lo += 2·dx.hi·dx.lo + dx.lo² (×3 dims)"]
    C --> D["Renormalize: fltflt_fast_two_sum(sum_hi, sum_lo)\n(handles sum_hi=0 cancellation corner-case;\nsum_lo carries full squared distance)"]
    D --> E["Stage 3: Newton-Raphson sqrt (no trailing two_sum)\nxn = rsqrt(sum_sq.hi)\nyn ≈ sqrt(sum_sq.hi)\nr_lo = Newton correction"]
    E --> F["Stage 4: Fused bin = fltflt_fma_approx({yn,r_lo}, dr_inv, mcp_partial)\n≡ (R - mcp)·dr_inv + bin_offset"]
    F --> G["Return fltflt bin"]

    style D fill:#fffde7,stroke:#f9a825
    style A fill:#e3f2fd,stroke:#1565c0
    style G fill:#e8f5e9,stroke:#2e7d32

_{Reviews (2): Last reviewed commit: "Address greptile feedback for SAR BP exa..." | Re-trigger Greptile}

Fix host-pinned memory allocation leak during error path and use std::memcpy for double->fltflt in-place conversions to avoid undefined behavior under C++ strict aliasing rules. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-05-16T18:37:45Z

/build

tbensonatl added 3 commits May 15, 2026 17:33

tbensonatl self-assigned this May 16, 2026

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

Comment thread examples/sarbp/sarbp.cu

Comment thread examples/sarbp/sarbp.cu

Address greptile feedback for SAR BP example

2feb5bc

Fix host-pinned memory allocation leak during error path and use std::memcpy for double->fltflt in-place conversions to avoid undefined behavior under C++ strict aliasing rules. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl requested a review from cliffburdick May 16, 2026 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SAR BP float-float bin computation#1183

Optimize SAR BP float-float bin computation#1183
tbensonatl wants to merge 4 commits into
mainfrom
tbenson/sarbp-fltflt-simplify-coord-difference

tbensonatl commented May 16, 2026

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

greptile-apps Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tbensonatl commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tbensonatl commented May 16, 2026

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

greptile-apps Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

tbensonatl commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 16, 2026 •

edited

Loading