Skip to content

fix: flamegraph regression#14

Closed
not-matthias wants to merge 4 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression-3
Closed

fix: flamegraph regression#14
not-matthias wants to merge 4 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression-3

Conversation

@not-matthias
Copy link
Copy Markdown
Member

  • test(callgrind): C reproducer for cascading underflow obj-skip leak
  • fix(callgrind): aggregate (cxt==0) and underflow leaks under a sentinel cxt
  • fix(callgrind): check obj-skip on every BB entry, not only jk_Call
  • fix(callgrind): drop BBCCs whose top context fn is skip-flagged

Triggers the call-stack-underflow leak channel observed in the Python
case (28 underflow events / run, almost all libpython interpreter
frames). Mechanism:

- Lib runs recursive skipme_recurse(N) with instrumentation OFF, so
  callgrind never sees the calls and its csp stays at 0.
- At the deepest frame (n==0), CALLGRIND_START_INSTRUMENTATION fires.
- Each RET on the way back hits csp == 0, triggers handleUnderflow,
  resets cxt to 0, and force-pushes the fn we're returning into.
- Because that fn is in the skipped lib, it leaks as a top-level fn=
  block in the dump — N times for an N-deep recursion.

With depth=5 the diagnostic logs show 1 (cxt==0) push + 6 underflow
resets (5x skipme_recurse + 1x skipme_run), and the .out has
fn=skipme_run and fn=skipme_recurse as top-level blocks.
…el cxt

When setup_bbcc's (cxt==0) clause or handleUnderflow would force-push
a skipped fn into the current context, push a synthetic (skipped) fn
instead. The skipped fn keeps its costs (routed normally through the
sentinel cxt) but never surfaces as its own fn= block — the dump
shows a single ob=??? fl=(callgrind-internal) fn=(skipped) block
aggregating all leaked frames.

The sentinel itself has skip=False so the (cxt==0 && skip) substitution
doesn't recurse on it. Created lazily on first need via a singleton
in fn.c, attached to the anonymous '???' obj.

Verified against both C reproducers (runtime_obj_skip_c and
runtime_obj_skip_underflow): no skipme_* fn= blocks appear, totals
are preserved. Verified against a non-skipped-attribution test that
main / do_main_work still emit normally; the sentinel only engages
on the leak paths.
The obj-skip check was gated on jmpkind == jk_Call. When a function in
a skipped object was entered via jk_Jump or fall-through (interpreter
dispatch, tail calls, perf trampoline, JIT), the skip flag never
latched and the function leaked into the dump as its own fn= block.

Also instrument the cxt==0 forced push_cxt path with a diagnostic line
so we can measure the residual leak when a skipped fn is forced into
a top-level context after an instrumentation start or call-stack
underflow.
When the (cxt == 0) clause in setup_bbcc force-pushes a skipped fn —
e.g. the first BB after CALLGRIND_START_INSTRUMENTATION lives in a
skipped object — the BBCC ends up with cxt->fn[0]->skip == True.
Without filtering, those BBCCs emit a top-level fn= block and the
skipped fn leaks into the dump.

Filter them out at dump time in print_bbccs_of_thread, right before
print_fn_pos would emit the ob=/fl=/fn= header. The call edges from
non-skipped callers into skipped fns (cfn=) are unaffected because
they're emitted from the caller's BBCC, whose context is not skipped.

Also broadens the runtime_obj_skip_c post-check to grep for any
fn=skipme*, since the actual leaked fn in the repro is skipme_run
(the one calling START_INSTRUMENTATION), not skipme_func.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 29, 2026

Merging this PR will improve performance by 13.8%

⚡ 3 improved benchmarks
✅ 53 untouched benchmarks
⏩ 4 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind.codspeed, stress-ng --cpu 1 --cpu-ops 10, inline] 5.3 s 4.7 s +13.42%
test_valgrind[valgrind.codspeed, python3 testdata/test.py, no-inline] 4.4 s 3.7 s +17.98%
test_valgrind[valgrind-3.25.1, python3 testdata/test.py, full-no-inline] 6 s 5.5 s +10.14%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing cod-2714-investigate-pytest-flamegraph-regression-3 (cd1d3d4) with master (4a64bb2)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant