Skip to content

fix: flamegraph regressions#13

Closed
not-matthias wants to merge 6 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression-2
Closed

fix: flamegraph regressions#13
not-matthias wants to merge 6 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression-2

Conversation

@not-matthias
Copy link
Copy Markdown
Member

  • fix(callgrind): drop BBCCs whose top context fn is skip-flagged
  • test(callgrind): add C reproducer for runtime obj-skip cxt==0 leak
  • chore(callgrind): log underflow resets and instrument_state transitions
  • fix(callgrind): discard cost and drop BBCCs from underflow events

When the (cxt == 0) clause in setup_bbcc force-pushes a skipped fn —
e.g. the first BB after CALLGRIND_START_INSTRUMENTATION lives in a
skipped object — the BBCC ends up with cxt->fn[0]->skip == True.
Without filtering, those BBCCs emit a top-level fn= block and the
skipped fn leaks into the dump.

Filter them out at dump time in print_bbccs_of_thread, right before
print_fn_pos would emit the ob=/fl=/fn= header. The call edges from
non-skipped callers into skipped fns (cfn=) are unaffected because
they're emitted from the caller's BBCC, whose context is not skipped.

Also broadens the runtime_obj_skip_c post-check to grep for any
fn=skipme*, since the actual leaked fn in the repro is skipme_run
(the one calling START_INSTRUMENTATION), not skipme_func.
Minimal C test that triggers the leak path where a function in a
skipped object becomes a top-level fn= block in the dump.

The trigger: the lib calls CALLGRIND_START_INSTRUMENTATION from
inside one of its own functions, so the first BB callgrind sees
post-start lives in the skipped object. With cxt == 0 at that point,
setup_bbcc force-pushes the skipped fn as the new top context and it
leaks into the output.

Currently RED: post-check fails because fn=skipme_run appears in the
output despite skipme_run's containing .so being on the skip list.
Two diagnostic logs to attribute every (cxt==0) push_cxt event to its
root cause:

1. handleUnderflow: prints the BB address, fn name, obj, and skip flag
   for the fn that's about to be force-pushed after a call-stack
   underflow. Tells us whether leaks are driven by signals, longjmp,
   or shadow-stack quality.

2. set_instrument_state: logs every ON/OFF transition with the cxt
   and fn-stack depth at that moment. Distinguishes 'fresh start with
   no cxt' from 'stop/start cycle that preserved cxt'.

Together with the existing 'push_cxt FORCED' log, these let us bucket
the leaked frames in real Python runs by their root cause.
When a RET happens with no matching CALL on Callgrind's shadow stack
(typical when instrumentation starts mid-run), handleUnderflow
synthesizes a phantom caller frame at return-addr - 1, attributes
all cost since the last dump to it via copy_cost(lastdump_cost), and
pushes a fake call_entry. The resulting frame is fabricated:

  - the caller fn is a guess based on an address that may not even
    be the start of a real BB,
  - the attributed cost is whatever happened to accumulate since the
    last dump, not anything actually spent in that fn,
  - the fn may resolve to an --obj-skip target (e.g. CPython's eval
    frame) which we explicitly do not want in the output.

Discard cost and drop the phantom from the dump:

  - flag the synthesized source_bbcc with from_underflow=True,
  - zero the call_entry enter_cost instead of copying lastdump_cost,
  - leave source_bbcc->ecounter_sum at 0 (no execution counted),
  - extend the print_bbccs_of_thread filter (added in the
    skip-fn-drop fix) to also skip from_underflow BBCCs.

The structural side-effects of handleUnderflow (cxt push, call
stack push, active-count adjustment) are kept intact so subsequent
RETs continue to balance correctly.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 28, 2026

Merging this PR will improve performance by 14.39%

⚡ 3 improved benchmarks
✅ 57 untouched benchmarks

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind.codspeed, python3 testdata/test.py, no-inline] 4.4 s 3.7 s +18.73%
test_valgrind[valgrind.codspeed, stress-ng --cpu 1 --cpu-ops 10, inline] 5.3 s 4.6 s +14.21%
test_valgrind[valgrind-3.25.1, python3 testdata/test.py, full-no-inline] 6 s 5.5 s +10.37%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing cod-2714-investigate-pytest-flamegraph-regression-2 (7b1886a) with master (4a64bb2)

Open in CodSpeed

The --obj-skip -> fn->skip propagation was only happening in setup_bbcc
on a jk_Call (when callgrind observes a CALL into the fn). When
instrumentation turns on mid-function (e.g. CALLGRIND_START_INSTRUMENTATION
fired from inside a skipped object, or an underflow synthesizes a new
top context), setup_bbcc takes the cxt==0 push path and force-pushes
the fn without ever consulting --obj-skip — so fn->skip stays False
and the skipped object leaks into the dump.

Extract the obj-skip resolution into resolve_obj_skip() and call it
on both paths. obj_skip_checked still short-circuits the lookup so the
hot path cost is unchanged.

This makes the BBCC drop filter from "drop BBCCs whose top context fn
is skip-flagged" actually catch the runtime_obj_skip_c repro: skipme_*
now gets fn->skip set on the force-push and is filtered at dump time.
dladdr((void*)&skipme_run) from the main binary returns the main
binary's path, not the lib's: &skipme_run there is the PLT stub
address in the main binary's text. So CALLGRIND_ADD_OBJ_SKIP was
registering the wrong object and the test never exercised the
runtime obj-skip leak it was supposed to repro.

Move the path resolution into the lib via skipme_self_path(), which
calls dladdr on its own function pointer. Add -ldl to the lib's
link line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant