Skip to content

fix: flamegraph regressions#12

Closed
not-matthias wants to merge 20 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression
Closed

fix: flamegraph regressions#12
not-matthias wants to merge 20 commits into
masterfrom
cod-2714-investigate-pytest-flamegraph-regression

Conversation

@not-matthias
Copy link
Copy Markdown
Member

No description provided.

Adds CLG_(dump_python_fn_summary) and calls it from dump_profile. For
each fn_node whose obj name contains 'python', prints the fn name, obj
path, skip flag, and obj_skip_checked flag, with a final tally. Lets us
see, per dump, whether interpreter fns ever had their skip check run.
Fires once per fn_node, before any call edge is built. Lets us see the
exact obj path string Callgrind stores for a function (e.g. PyEval) so
we can compare it against the obj_skip list independently of any
strcmp result.
The obj-skip check was gated on jmpkind == jk_Call. When a function in
a skipped object was entered via jk_Jump or fall-through (interpreter
dispatch, tail calls, perf trampoline, JIT), the skip flag never
latched and the function leaked into the dump as its own fn= block.

Also instrument the cxt==0 forced push_cxt path with a diagnostic line
so we can measure the residual leak when a skipped fn is forced into
a top-level context after an instrumentation start or call-stack
underflow.
Minimal C test that triggers the leak path where a function in a
skipped object becomes a top-level fn= block in the dump.

The trigger: the lib calls CALLGRIND_START_INSTRUMENTATION from
inside one of its own functions, so the first BB callgrind sees
post-start lives in the skipped object. With cxt == 0 at that point,
setup_bbcc force-pushes the skipped fn as the new top context and it
leaks into the output.

Currently RED: post-check fails because fn=skipme_run appears in the
output despite skipme_run's containing .so being on the skip list.
When the (cxt == 0) clause in setup_bbcc force-pushes a skipped fn —
e.g. the first BB after CALLGRIND_START_INSTRUMENTATION lives in a
skipped object — the BBCC ends up with cxt->fn[0]->skip == True.
Without filtering, those BBCCs emit a top-level fn= block and the
skipped fn leaks into the dump.

Filter them out at dump time in print_bbccs_of_thread, right before
print_fn_pos would emit the ob=/fl=/fn= header. The call edges from
non-skipped callers into skipped fns (cfn=) are unaffected because
they're emitted from the caller's BBCC, whose context is not skipped.

Also broadens the runtime_obj_skip_c post-check to grep for any
fn=skipme*, since the actual leaked fn in the repro is skipme_run
(the one calling START_INSTRUMENTATION), not skipme_func.
Two diagnostic logs to attribute every (cxt==0) push_cxt event to its
root cause:

1. handleUnderflow: prints the BB address, fn name, obj, and skip flag
   for the fn that's about to be force-pushed after a call-stack
   underflow. Tells us whether leaks are driven by signals, longjmp,
   or shadow-stack quality.

2. set_instrument_state: logs every ON/OFF transition with the cxt
   and fn-stack depth at that moment. Distinguishes 'fresh start with
   no cxt' from 'stop/start cycle that preserved cxt'.

Together with the existing 'push_cxt FORCED' log, these let us bucket
the leaked frames in real Python runs by their root cause.
Triggers the call-stack-underflow leak channel observed in the Python
case (28 underflow events / run, almost all libpython interpreter
frames). Mechanism:

- Lib runs recursive skipme_recurse(N) with instrumentation OFF, so
  callgrind never sees the calls and its csp stays at 0.
- At the deepest frame (n==0), CALLGRIND_START_INSTRUMENTATION fires.
- Each RET on the way back hits csp == 0, triggers handleUnderflow,
  resets cxt to 0, and force-pushes the fn we're returning into.
- Because that fn is in the skipped lib, it leaks as a top-level fn=
  block in the dump — N times for an N-deep recursion.

With depth=5 the diagnostic logs show 1 (cxt==0) push + 6 underflow
resets (5x skipme_recurse + 1x skipme_run), and the .out has
fn=skipme_run and fn=skipme_recurse as top-level blocks.
…el cxt

When setup_bbcc's (cxt==0) clause or handleUnderflow would force-push
a skipped fn into the current context, push a synthetic (skipped) fn
instead. The skipped fn keeps its costs (routed normally through the
sentinel cxt) but never surfaces as its own fn= block — the dump
shows a single ob=??? fl=(callgrind-internal) fn=(skipped) block
aggregating all leaked frames.

The sentinel itself has skip=False so the (cxt==0 && skip) substitution
doesn't recurse on it. Created lazily on first need via a singleton
in fn.c, attached to the anonymous '???' obj.

Verified against both C reproducers (runtime_obj_skip_c and
runtime_obj_skip_underflow): no skipme_* fn= blocks appear, totals
are preserved. Verified against a non-skipped-attribution test that
main / do_main_work still emit normally; the sentinel only engages
on the leak paths.
When CALLGRIND_START_INSTRUMENTATION fires mid-stack (typical for
pytest-codspeed: Python reaches the macro several libpython frames
deep), callgrind's csp is 0 but the real stack is non-empty. Every
subsequent client `ret` peels a frame callgrind never saw the matching
`call` for, trips handleUnderflow, and leaks the returned-into fn as a
top-level fn= block — polluting the flamegraph with phantom roots like
_PyEval_EvalFrameDefault, PyObject_Vectorcall, etc.

Reconstruct the shadow stack from VG_(get_StackTrace) on the OFF->ON
transition. For each native frame: push a (jcc=0, skip-style)
call_entry with the captured SP and ret_addr=caller_ip+1. For
non-skipped caller frames, synthesize a zero-instruction BBCC tagged
with that frame's cxt so obj-skip's `nonskipped` mechanism has a
target to fold skipped-subtree costs into.

Anonymous IPs (Python JIT regions, CRT glue) resolve via the existing
`???` obj path in get_obj_node, so no special trimming is needed.

Un-static new_recursion and insert_bbcc_into_hash (renamed to
CLG_-prefixed) so callstack.c can reuse them.
…-checks

- runtime_obj_skip_c.vgtest: post-check now greps for any fn=skipme_*.
  The previous check only looked for skipme_func, which obj-skip
  folded into the caller so the test passed even with the bug live.
- runtime_obj_skip_underflow: wire .vgtest/.post.exp/.stderr.exp +
  Makefile.am entries (was only checked-in as bare .c sources).
- runtime_obj_skip_py314.vgtest: same shim driver as the 3.13 test,
  exercised with python3.14 (tail-call interpreter). Skipped by
  prereq if python3.14 isn't on PATH.
- filter_stderr: drop diagnostic logging lines added during this
  investigation (instrument_state, obj_skip HIT/miss, underflow reset,
  reconstruct_call_stack, python fn summary, fn=...obj=... summary)
  so .stderr.exp matching stays stable.

All four reproducers go RED before the fix, GREEN after.
@not-matthias not-matthias force-pushed the cod-2714-investigate-pytest-flamegraph-regression branch from 75ab97d to dfe44cc Compare May 27, 2026 15:57
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 27, 2026

Merging this PR will not alter performance

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 54 untouched benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind.codspeed, python3 testdata/test.py, no-inline] 4.4 s 3.9 s +12.71%
test_valgrind[valgrind.codspeed, stress-ng --cpu 1 --cpu-ops 10, no-inline] 4.3 s 5 s -13.49%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing cod-2714-investigate-pytest-flamegraph-regression (72e5f60) with master (4a64bb2)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

…eeded

Move -l:runtime_obj_skip_*_lib.so from LDFLAGS to LDADD so the shared
lib appears after the .o files on the link line. On Ubuntu 24.04, ld
defaults to --as-needed and was dropping the lib before skipme_run was
referenced, causing an undefined-reference link error in CI.
dash's 'command -v' returns 127 when the program is missing, which
vg_regtest treats as a fatal abort (it only accepts 0=run or 1=skip).
Append '|| exit 1' so the prereq cleanly skips on hosts without
python3.14 (e.g. Ubuntu 22.04 CI runners).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes runtime obj-skip flamegraph regressions in callgrind by improving skip detection beyond direct calls, reconstructing callgrind’s shadow call stack when instrumentation starts mid-stack, and adding regression coverage for C and Python reproducers.

Changes:

  • Applies obj-skip checks on every BB entry instead of only jk_Call.
  • Adds native stack reconstruction on CALLGRIND_START_INSTRUMENTATION.
  • Adds regression tests for skipped-object leaks, underflow behavior, and Python 3.14.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
callgrind/bbcc.c Extends obj-skip latching and exposes BBCC helpers for reconstruction.
callgrind/callstack.c Adds native stack reconstruction for START instrumentation.
callgrind/clo.c Adds diagnostic output when registering skipped objects.
callgrind/dump.c Adds Python function diagnostic summary on profile dumps.
callgrind/fn.c Adds function lookup/count helpers and diagnostic logging.
callgrind/global.h Declares new helper APIs used across callgrind files.
callgrind/main.c Calls reconstruction after START instrumentation.
callgrind/tests/Makefile.am Builds and distributes new runtime obj-skip regression tests.
callgrind/tests/filter_stderr Filters new diagnostic output from test stderr.
callgrind/tests/runtime_obj_skip_c.c Adds minimal C driver for skipped-object START regression.
callgrind/tests/runtime_obj_skip_c_lib.c Adds skipped shared library fixture for C regression.
callgrind/tests/runtime_obj_skip_c.vgtest Adds C runtime obj-skip regression test.
callgrind/tests/runtime_obj_skip_c.stderr.exp Adds expected stderr for C regression test.
callgrind/tests/runtime_obj_skip_c.post.exp Adds expected post output for C regression test.
callgrind/tests/runtime_obj_skip_underflow.c Adds C driver for underflow-channel regression.
callgrind/tests/runtime_obj_skip_underflow_lib.c Adds recursive skipped-library fixture.
callgrind/tests/runtime_obj_skip_underflow.vgtest Adds underflow runtime obj-skip regression test.
callgrind/tests/runtime_obj_skip_underflow.stderr.exp Adds expected stderr for underflow test.
callgrind/tests/runtime_obj_skip_underflow.post.exp Adds expected post output for underflow test.
callgrind/tests/runtime_obj_skip_py314.vgtest Adds Python 3.14-specific runtime obj-skip regression test.
callgrind/tests/runtime_obj_skip_py314.stderr.exp Adds expected stderr for Python 3.14 test.
callgrind/tests/runtime_obj_skip_py314.post.exp Adds expected post output for Python 3.14 test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread callgrind/callstack.c
Comment on lines +518 to +520
call_entry* ce = &cs->entry[cs->sp];
ce->jcc = 0;
ce->sp = sps[frame];
Comment thread callgrind/callstack.c Outdated
Comment on lines +504 to +506
CLG_(insert_bbcc_into_hash)(b);
}
caller_bbcc = b;
Comment thread callgrind/dump.c

print_bbccs(trigger, only_current_thread);

CLG_(dump_python_fn_summary)();
Comment thread callgrind/bbcc.c
Comment on lines +753 to +763
if (!skip && CLG_(clo).objs_to_skip_count > 0) {
VG_(message)(Vg_UserMsg,
"obj_skip miss: fn='%s' obj='%s' (len=%lu, %d entries) jmpkind=%d\n",
node->name, obj_name,
VG_(strlen)(obj_name), CLG_(clo).objs_to_skip_count,
(int)jmpkind);
for (int i=0; i<CLG_(clo).objs_to_skip_count; i++)
VG_(message)(Vg_UserMsg, " vs [%d] strcmp=%d (len=%lu) '%s'\n",
i, cmp_results[i],
VG_(strlen)(CLG_(clo).objs_to_skip[i]),
CLG_(clo).objs_to_skip[i]);
Comment thread callgrind/fn.c
Comment on lines +515 to +517
VG_(message)(Vg_UserMsg, "new_fn_node: fn='%s' obj='%s'\n",
fn->name,
(file && file->obj && file->obj->name) ? file->obj->name : "(null)");
Comment thread callgrind/bbcc.c
if (!skip && !node->obj_skip_checked){
HChar* obj_name = node->file->obj->name;
// VG_(printf)(" %s\n", obj_name);
Int cmp_results[CLG_(clo).objs_to_skip_count];
The native-stack seed in CLG_(reconstruct_call_stack_from_native) was
pushing every frame, including frame 0 (the function that fired
CALLGRIND_START_INSTRUMENTATION). Seeded call_entries have jcc=0, and
pop_call_stack only restores cxt when jcc!=0 — so on frame 0's `ret`
the function stayed stuck on top of the cxt chain and phantom-parented
every subsequent call from the real caller.

Visible in the CodSpeed flamegraph as callgrind_start_instrumentation
appearing as the parent of the benchmark body.

Stop the seed loop at frame 1; the trailing epilogue of frame 0 is
harmlessly attributed to the caller's cxt.

Add callgrind/tests/phantom_root as a minimal red/green reproducer:
a start_and_return() function fires the macro and returns, then main
calls leaf(). Post-check asserts leaf's only caller is main.
The previous fix only stopped phantom-parenting from frame 0. The bug
is structural: every seeded call_entry has jcc=0, and pop_call_stack
only restores cxt when jcc!=0. With a multi-frame wrapper chain (e.g.
Node.js: macro -> C export -> N-API trampoline -> user JS), each ret
on the way out of the chain fails to pop the cxt, leaving the deepest
un-returned wrapper stuck on top.

Fix it where the asymmetry lives: in pop_call_stack, also restore cxt
when the entry has a non-zero saved cxt. push_cxt always populates
lower_entry->cxt, while real skip-entries (push_call_stack(skip=True)
without a prior push_cxt) leave it at 0 — so the new branch fires
only for seeded entries.

With cxt now popping cleanly through the whole seeded chain, the
frame-0 carve-out is no longer needed; the seed loop walks all
native frames again.

Extend phantom_root to a 3-deep wrapper chain (wrapper_outer ->
wrapper_middle -> wrapper_inner fires START and unwinds) so the
multi-frame case is locked under test.
The previous fix correctly restored cxt on the way back out of named
wrapper frames, but Node.js / Python JIT scenarios have an anonymous
frame between the codspeed C wrappers and user code (a V8 trampoline
or JIT-generated trampoline, mmap'd in an anonymous region with no
DebugInfo). CLG_(get_fn_node_for_addr) resolves that frame to
fn->name == "???". The seed used to push_cxt that "???" fn onto the
chain, but the JIT frame never RETs via C-ABI before user code runs,
so the "???" entry stayed stuck on top of cxt and phantom-parented
every user fn — visible in a CodSpeed flamegraph as a "???" or
0x{addr} row sitting between the benchmark label and the actual user
function, holding the bulk of the inclusive cost.

Skip push_cxt and BBCC synthesis for these frames. Still push a bare
call_entry so SP-based unwind works for the underflow case; ce->cxt
stays 0, which the pop_call_stack else-branch already treats as
"no-op cxt restore".

Verified against a Node.js benchmark under the codspeed runner: the
"???" root previously holding 99.99% of inclusive cost is gone; user
JS frames now appear directly under the benchmark.

The phantom_root regression test (named-wrapper chain) still passes.
An anonymous-frame C reproducer (mmap'd thunk between main and the
wrapper) was attempted but hit an unrelated (below main) attribution
issue and was dropped — the Node.js manual reproducer covers this
case end-to-end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants