Skip to content

Optimize SAR BP float-float bin computation#1183

Open
tbensonatl wants to merge 4 commits into
mainfrom
tbenson/sarbp-fltflt-simplify-coord-difference
Open

Optimize SAR BP float-float bin computation#1183
tbensonatl wants to merge 4 commits into
mainfrom
tbenson/sarbp-fltflt-simplify-coord-difference

Conversation

@tbensonatl
Copy link
Copy Markdown
Collaborator

SAR BP kernel optimizations providing ~7% uplift for fltflt path

Apply two optimizations to the SAR BP kernel:

    - For the fltflt path, use a fused ComputeBinToPixelFloatFloat() function to compute the range bin for a pixel-pulse pair. This fused version reduces the total number of float-float normalizations while handling the possibility that all .hi components cancel to zero for the antenna and pixel position differences. Those cases will likely be relaxed for a future MCP-centered variant of the SAR backprojector where an antenna position at the mocomp point would not be realistic.
    - For all paths, convert the floating-point range bin floor to int32_t instead of index_t. This removes operations from the FP64 pipeline if index_t is 64-bit. We already implicitly required that the number of range bins <= 2^24 for correctness of the floorf(bin) calculations, but now that is explicitly documented and verified via a run-time assertion.

Refactor the SAR BP example so that it passes fltflt values for the antenna positions
and range-to-mcp values when using the fltflt precision. This prevents the kernel
from having to perform the double->fltflt conversions during its pulse block preamble.
The refactor creates a new runner function so that it can be templatized on the
position and rtm types.

Update the example viewer script to use a percentile-based contrast stretch.
Replace the example image with one generated from the newer script. This approach
generates images where the bright scatterers no longer dominate the dynamic range,
resulting in much brighter midtones.

Apply two optimizations to the SAR BP kernel:

    - For the fltflt path, use a looser version of fltflt_sub() to compute the
      component-wise pixel-to-antenna distances. By not fully normalizing the values,
      we save some computation at very low accuracy impact for typical SAR scenarios.
    - For all paths, convert the floating-point range bin floor to int32_t instead of
      index_t. This removes operations from the FP64 pipeline if index_t is 64-bit.
      We already implicitly required that the number of range bins <= 2^24 for correctness
      off the floorf(bin) calculations, but now that is explicitly documented and
      verified via a run-time assertion.

Refactor the SAR BP example so that it passes fltflt values for the antenna positions
and range-to-mcp values when using the fltflt precision. This prevents the kernel
from having to perform the double->fltflt conversions during its pulse block preamble.
The refactor creates a new runner function so that it can be templatized on the
position and rtm types.

Update the example viewer script to use a percentile-based contrast stretch.
Replace the example image with one generated from the newer script. This approach
generates images where the bright scatterers no longer dominate the dynamic range,
resulting in much brighter midtones.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
an antenna position and pixel cancel. Change to a fully fused SAR BP specialized
range-to-bin fltflt function. This removes normalization when possible and
exploits other known invariants of the geometry. Overall, the impact of the PR
is a 10% uplift in fltflt performance on RTX PRO 6000.

Also loosen the 2^24 range bin restriction for ComputeType==Double as it does
not apply there.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
… rsqrtf(). This

reduces our gain from >10% to 7%. In the future, we will add a fast-path for cases
where e.g. we know that the antenna z is far above the pixel, in which case this
extra normalization would not be required.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl self-assigned this May 16, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR delivers two focused optimizations to the SAR backprojection kernel and refactors the example driver to support them. The fltflt bin-computation path is accelerated by ~7% via a new fused ComputeBinToPixelFloatFloat helper that avoids materializing a canonical float-float range R, saving the trailing fast_two_sum normalization inside sqrt_fast. All compute paths benefit from narrowing the range-bin floor index from index_t (64-bit on most targets) to int32_t, removing dead FP64 pipeline operations.

  • New fused helper ComputeBinToPixelFloatFloat in sar_bp.cuh folds coordinate-diff, squared-norm, Newton-Raphson sqrt, and fma_approx into one device function, with explicit lo×lo term retention to handle fp32 cancellation when the antenna is sub-ULP-close to a pixel.
  • int32_t bin index replaces index_t throughout the kernel's inner loop; runtime assertions in sar_bp.h enforce the new 2²⁴ cap for fp32 paths and the INT32_MAX cap for all paths, verified by two new TYPED_TEST cases.
  • Example refactored into a templated run_bp_device helper that accepts either tensor<double3> or tensor<fltflt> position tensors; the fltflt host buffers are converted in-place before upload using memcpy-based type-punning (fixing the strict-aliasing UB flagged in prior review).

Confidence Score: 5/5

The changes are targeted optimizations to a well-tested kernel with new runtime guards and tests; no regressions were identified.

The fused ComputeBinToPixelFloatFloat function handles all documented corner cases (fp32 cancellation, sum_hi=0) via the renormalization step, and the math is internally consistent. The int32_t narrowing of bin indices is protected by new launch-time assertions verified by two dedicated tests. The example refactoring correctly uses memcpy-based type-punning for the in-place fltflt conversion, the previously reported h_image leak on file-open failure is fixed, and the Python viewer's percentile validation is correct.

No files require special attention. The most complex change (ComputeBinToPixelFloatFloat in sar_bp.cuh) is well-commented and mathematically sound.

Important Files Changed

Filename Overview
include/matx/kernels/sar_bp.cuh Replaces ComputeRangeToPixelFloatFloat with the fused ComputeBinToPixelFloatFloat (fuses 4 computation stages, retains lo×lo terms for cancellation corner-case, adds renormalize before sqrt). Narrows bin_floor_int and num_range_bins from index_t to int32_t throughout the inner loop.
include/matx/transforms/sar_bp.h Adds two new launch-time range-bin limit assertions: INT32_MAX for all paths and 2^24 for fp32 paths, with clear error messages. Logic is correct and well-documented.
examples/sarbp/sarbp.cu Introduces BpRunCtx aggregate and templated run_bp_device to share code between double and fltflt paths. In-place fltflt conversion uses memcpy-based type-punning (correct, standards-compliant). Previously reported h_image leak on file-open failure is now fixed.
test/00_transform/SarBp.cu Two new boundary tests (RangeBinsLimitFp32Path, RangeBinsLimitDoublePath) correctly verify the 2^24 inclusive upper bound for fp32 paths and the exemption for Double compute type; use generator tensors to avoid large allocations.
examples/sarbp/view_sarbp_image.py Adds percentile-based contrast stretch as the new default; --dynamic-range becomes an override. Input validation (0 <= LO < HI <= 100) is present and correct.
include/matx/operators/sar_bp.h Documentation-only addition: documents the 2^24 fp32 bin-count constraint and the int32_t indexing limit for all compute types.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ComputeBinToPixelFloatFloat(apx,apy,apz, px,py,pz, dr_inv, mcp_partial)"]
    A --> B["Stage 1: Loose coordinate differences\ndx = fltflt_two_sum(px, -apx.hi)\ndx.lo -= apx.lo  (same for dy, dz)"]
    B --> C["Stage 2: Squared-norm with lo×lo retention\npx2 = two_prod_fma(dx.hi, dx.hi)\nsum_lo += 2·dx.hi·dx.lo + dx.lo² (×3 dims)"]
    C --> D["Renormalize: fltflt_fast_two_sum(sum_hi, sum_lo)\n(handles sum_hi=0 cancellation corner-case;\nsum_lo carries full squared distance)"]
    D --> E["Stage 3: Newton-Raphson sqrt (no trailing two_sum)\nxn = rsqrt(sum_sq.hi)\nyn ≈ sqrt(sum_sq.hi)\nr_lo = Newton correction"]
    E --> F["Stage 4: Fused bin = fltflt_fma_approx({yn,r_lo}, dr_inv, mcp_partial)\n≡ (R - mcp)·dr_inv + bin_offset"]
    F --> G["Return fltflt bin"]

    style D fill:#fffde7,stroke:#f9a825
    style A fill:#e3f2fd,stroke:#1565c0
    style G fill:#e8f5e9,stroke:#2e7d32
Loading

Reviews (2): Last reviewed commit: "Address greptile feedback for SAR BP exa..." | Re-trigger Greptile

Comment thread examples/sarbp/sarbp.cu
Comment thread examples/sarbp/sarbp.cu
Fix host-pinned memory allocation leak during error path and use std::memcpy
for double->fltflt in-place conversions to avoid undefined behavior under
C++ strict aliasing rules.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@tbensonatl tbensonatl requested a review from cliffburdick May 16, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant