Optimize SAR BP float-float bin computation#1183
Conversation
Apply two optimizations to the SAR BP kernel:
- For the fltflt path, use a looser version of fltflt_sub() to compute the
component-wise pixel-to-antenna distances. By not fully normalizing the values,
we save some computation at very low accuracy impact for typical SAR scenarios.
- For all paths, convert the floating-point range bin floor to int32_t instead of
index_t. This removes operations from the FP64 pipeline if index_t is 64-bit.
We already implicitly required that the number of range bins <= 2^24 for correctness
off the floorf(bin) calculations, but now that is explicitly documented and
verified via a run-time assertion.
Refactor the SAR BP example so that it passes fltflt values for the antenna positions
and range-to-mcp values when using the fltflt precision. This prevents the kernel
from having to perform the double->fltflt conversions during its pulse block preamble.
The refactor creates a new runner function so that it can be templatized on the
position and rtm types.
Update the example viewer script to use a percentile-based contrast stretch.
Replace the example image with one generated from the newer script. This approach
generates images where the bright scatterers no longer dominate the dynamic range,
resulting in much brighter midtones.
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
an antenna position and pixel cancel. Change to a fully fused SAR BP specialized range-to-bin fltflt function. This removes normalization when possible and exploits other known invariants of the geometry. Overall, the impact of the PR is a 10% uplift in fltflt performance on RTX PRO 6000. Also loosen the 2^24 range bin restriction for ComputeType==Double as it does not apply there. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
… rsqrtf(). This reduces our gain from >10% to 7%. In the future, we will add a fast-path for cases where e.g. we know that the antenna z is far above the pixel, in which case this extra normalization would not be required. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Greptile SummaryThis PR delivers two focused optimizations to the SAR backprojection kernel and refactors the example driver to support them. The fltflt bin-computation path is accelerated by ~7% via a new fused
Confidence Score: 5/5The changes are targeted optimizations to a well-tested kernel with new runtime guards and tests; no regressions were identified. The fused ComputeBinToPixelFloatFloat function handles all documented corner cases (fp32 cancellation, sum_hi=0) via the renormalization step, and the math is internally consistent. The int32_t narrowing of bin indices is protected by new launch-time assertions verified by two dedicated tests. The example refactoring correctly uses memcpy-based type-punning for the in-place fltflt conversion, the previously reported h_image leak on file-open failure is fixed, and the Python viewer's percentile validation is correct. No files require special attention. The most complex change (ComputeBinToPixelFloatFloat in sar_bp.cuh) is well-commented and mathematically sound. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ComputeBinToPixelFloatFloat(apx,apy,apz, px,py,pz, dr_inv, mcp_partial)"]
A --> B["Stage 1: Loose coordinate differences\ndx = fltflt_two_sum(px, -apx.hi)\ndx.lo -= apx.lo (same for dy, dz)"]
B --> C["Stage 2: Squared-norm with lo×lo retention\npx2 = two_prod_fma(dx.hi, dx.hi)\nsum_lo += 2·dx.hi·dx.lo + dx.lo² (×3 dims)"]
C --> D["Renormalize: fltflt_fast_two_sum(sum_hi, sum_lo)\n(handles sum_hi=0 cancellation corner-case;\nsum_lo carries full squared distance)"]
D --> E["Stage 3: Newton-Raphson sqrt (no trailing two_sum)\nxn = rsqrt(sum_sq.hi)\nyn ≈ sqrt(sum_sq.hi)\nr_lo = Newton correction"]
E --> F["Stage 4: Fused bin = fltflt_fma_approx({yn,r_lo}, dr_inv, mcp_partial)\n≡ (R - mcp)·dr_inv + bin_offset"]
F --> G["Return fltflt bin"]
style D fill:#fffde7,stroke:#f9a825
style A fill:#e3f2fd,stroke:#1565c0
style G fill:#e8f5e9,stroke:#2e7d32
Reviews (2): Last reviewed commit: "Address greptile feedback for SAR BP exa..." | Re-trigger Greptile |
Fix host-pinned memory allocation leak during error path and use std::memcpy for double->fltflt in-place conversions to avoid undefined behavior under C++ strict aliasing rules. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
|
/build |
SAR BP kernel optimizations providing ~7% uplift for fltflt path
Apply two optimizations to the SAR BP kernel:
Refactor the SAR BP example so that it passes fltflt values for the antenna positions
and range-to-mcp values when using the fltflt precision. This prevents the kernel
from having to perform the double->fltflt conversions during its pulse block preamble.
The refactor creates a new runner function so that it can be templatized on the
position and rtm types.
Update the example viewer script to use a percentile-based contrast stretch.
Replace the example image with one generated from the newer script. This approach
generates images where the bright scatterers no longer dominate the dynamic range,
resulting in much brighter midtones.