fix: flamegraph regressions#12
Conversation
Adds CLG_(dump_python_fn_summary) and calls it from dump_profile. For each fn_node whose obj name contains 'python', prints the fn name, obj path, skip flag, and obj_skip_checked flag, with a final tally. Lets us see, per dump, whether interpreter fns ever had their skip check run.
Fires once per fn_node, before any call edge is built. Lets us see the exact obj path string Callgrind stores for a function (e.g. PyEval) so we can compare it against the obj_skip list independently of any strcmp result.
The obj-skip check was gated on jmpkind == jk_Call. When a function in a skipped object was entered via jk_Jump or fall-through (interpreter dispatch, tail calls, perf trampoline, JIT), the skip flag never latched and the function leaked into the dump as its own fn= block. Also instrument the cxt==0 forced push_cxt path with a diagnostic line so we can measure the residual leak when a skipped fn is forced into a top-level context after an instrumentation start or call-stack underflow.
Minimal C test that triggers the leak path where a function in a skipped object becomes a top-level fn= block in the dump. The trigger: the lib calls CALLGRIND_START_INSTRUMENTATION from inside one of its own functions, so the first BB callgrind sees post-start lives in the skipped object. With cxt == 0 at that point, setup_bbcc force-pushes the skipped fn as the new top context and it leaks into the output. Currently RED: post-check fails because fn=skipme_run appears in the output despite skipme_run's containing .so being on the skip list.
When the (cxt == 0) clause in setup_bbcc force-pushes a skipped fn — e.g. the first BB after CALLGRIND_START_INSTRUMENTATION lives in a skipped object — the BBCC ends up with cxt->fn[0]->skip == True. Without filtering, those BBCCs emit a top-level fn= block and the skipped fn leaks into the dump. Filter them out at dump time in print_bbccs_of_thread, right before print_fn_pos would emit the ob=/fl=/fn= header. The call edges from non-skipped callers into skipped fns (cfn=) are unaffected because they're emitted from the caller's BBCC, whose context is not skipped. Also broadens the runtime_obj_skip_c post-check to grep for any fn=skipme*, since the actual leaked fn in the repro is skipme_run (the one calling START_INSTRUMENTATION), not skipme_func.
This reverts commit bcfb4c9.
Two diagnostic logs to attribute every (cxt==0) push_cxt event to its root cause: 1. handleUnderflow: prints the BB address, fn name, obj, and skip flag for the fn that's about to be force-pushed after a call-stack underflow. Tells us whether leaks are driven by signals, longjmp, or shadow-stack quality. 2. set_instrument_state: logs every ON/OFF transition with the cxt and fn-stack depth at that moment. Distinguishes 'fresh start with no cxt' from 'stop/start cycle that preserved cxt'. Together with the existing 'push_cxt FORCED' log, these let us bucket the leaked frames in real Python runs by their root cause.
Triggers the call-stack-underflow leak channel observed in the Python case (28 underflow events / run, almost all libpython interpreter frames). Mechanism: - Lib runs recursive skipme_recurse(N) with instrumentation OFF, so callgrind never sees the calls and its csp stays at 0. - At the deepest frame (n==0), CALLGRIND_START_INSTRUMENTATION fires. - Each RET on the way back hits csp == 0, triggers handleUnderflow, resets cxt to 0, and force-pushes the fn we're returning into. - Because that fn is in the skipped lib, it leaks as a top-level fn= block in the dump — N times for an N-deep recursion. With depth=5 the diagnostic logs show 1 (cxt==0) push + 6 underflow resets (5x skipme_recurse + 1x skipme_run), and the .out has fn=skipme_run and fn=skipme_recurse as top-level blocks.
…el cxt When setup_bbcc's (cxt==0) clause or handleUnderflow would force-push a skipped fn into the current context, push a synthetic (skipped) fn instead. The skipped fn keeps its costs (routed normally through the sentinel cxt) but never surfaces as its own fn= block — the dump shows a single ob=??? fl=(callgrind-internal) fn=(skipped) block aggregating all leaked frames. The sentinel itself has skip=False so the (cxt==0 && skip) substitution doesn't recurse on it. Created lazily on first need via a singleton in fn.c, attached to the anonymous '???' obj. Verified against both C reproducers (runtime_obj_skip_c and runtime_obj_skip_underflow): no skipme_* fn= blocks appear, totals are preserved. Verified against a non-skipped-attribution test that main / do_main_work still emit normally; the sentinel only engages on the leak paths.
…a sentinel cxt" This reverts commit 4a7c1d5.
When CALLGRIND_START_INSTRUMENTATION fires mid-stack (typical for pytest-codspeed: Python reaches the macro several libpython frames deep), callgrind's csp is 0 but the real stack is non-empty. Every subsequent client `ret` peels a frame callgrind never saw the matching `call` for, trips handleUnderflow, and leaks the returned-into fn as a top-level fn= block — polluting the flamegraph with phantom roots like _PyEval_EvalFrameDefault, PyObject_Vectorcall, etc. Reconstruct the shadow stack from VG_(get_StackTrace) on the OFF->ON transition. For each native frame: push a (jcc=0, skip-style) call_entry with the captured SP and ret_addr=caller_ip+1. For non-skipped caller frames, synthesize a zero-instruction BBCC tagged with that frame's cxt so obj-skip's `nonskipped` mechanism has a target to fold skipped-subtree costs into. Anonymous IPs (Python JIT regions, CRT glue) resolve via the existing `???` obj path in get_obj_node, so no special trimming is needed. Un-static new_recursion and insert_bbcc_into_hash (renamed to CLG_-prefixed) so callstack.c can reuse them.
…-checks - runtime_obj_skip_c.vgtest: post-check now greps for any fn=skipme_*. The previous check only looked for skipme_func, which obj-skip folded into the caller so the test passed even with the bug live. - runtime_obj_skip_underflow: wire .vgtest/.post.exp/.stderr.exp + Makefile.am entries (was only checked-in as bare .c sources). - runtime_obj_skip_py314.vgtest: same shim driver as the 3.13 test, exercised with python3.14 (tail-call interpreter). Skipped by prereq if python3.14 isn't on PATH. - filter_stderr: drop diagnostic logging lines added during this investigation (instrument_state, obj_skip HIT/miss, underflow reset, reconstruct_call_stack, python fn summary, fn=...obj=... summary) so .stderr.exp matching stays stable. All four reproducers go RED before the fix, GREEN after.
75ab97d to
dfe44cc
Compare
Merging this PR will not alter performance
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
…eeded Move -l:runtime_obj_skip_*_lib.so from LDFLAGS to LDADD so the shared lib appears after the .o files on the link line. On Ubuntu 24.04, ld defaults to --as-needed and was dropping the lib before skipme_run was referenced, causing an undefined-reference link error in CI.
dash's 'command -v' returns 127 when the program is missing, which vg_regtest treats as a fatal abort (it only accepts 0=run or 1=skip). Append '|| exit 1' so the prereq cleanly skips on hosts without python3.14 (e.g. Ubuntu 22.04 CI runners).
There was a problem hiding this comment.
Pull request overview
This PR fixes runtime obj-skip flamegraph regressions in callgrind by improving skip detection beyond direct calls, reconstructing callgrind’s shadow call stack when instrumentation starts mid-stack, and adding regression coverage for C and Python reproducers.
Changes:
- Applies
obj-skipchecks on every BB entry instead of onlyjk_Call. - Adds native stack reconstruction on
CALLGRIND_START_INSTRUMENTATION. - Adds regression tests for skipped-object leaks, underflow behavior, and Python 3.14.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
callgrind/bbcc.c |
Extends obj-skip latching and exposes BBCC helpers for reconstruction. |
callgrind/callstack.c |
Adds native stack reconstruction for START instrumentation. |
callgrind/clo.c |
Adds diagnostic output when registering skipped objects. |
callgrind/dump.c |
Adds Python function diagnostic summary on profile dumps. |
callgrind/fn.c |
Adds function lookup/count helpers and diagnostic logging. |
callgrind/global.h |
Declares new helper APIs used across callgrind files. |
callgrind/main.c |
Calls reconstruction after START instrumentation. |
callgrind/tests/Makefile.am |
Builds and distributes new runtime obj-skip regression tests. |
callgrind/tests/filter_stderr |
Filters new diagnostic output from test stderr. |
callgrind/tests/runtime_obj_skip_c.c |
Adds minimal C driver for skipped-object START regression. |
callgrind/tests/runtime_obj_skip_c_lib.c |
Adds skipped shared library fixture for C regression. |
callgrind/tests/runtime_obj_skip_c.vgtest |
Adds C runtime obj-skip regression test. |
callgrind/tests/runtime_obj_skip_c.stderr.exp |
Adds expected stderr for C regression test. |
callgrind/tests/runtime_obj_skip_c.post.exp |
Adds expected post output for C regression test. |
callgrind/tests/runtime_obj_skip_underflow.c |
Adds C driver for underflow-channel regression. |
callgrind/tests/runtime_obj_skip_underflow_lib.c |
Adds recursive skipped-library fixture. |
callgrind/tests/runtime_obj_skip_underflow.vgtest |
Adds underflow runtime obj-skip regression test. |
callgrind/tests/runtime_obj_skip_underflow.stderr.exp |
Adds expected stderr for underflow test. |
callgrind/tests/runtime_obj_skip_underflow.post.exp |
Adds expected post output for underflow test. |
callgrind/tests/runtime_obj_skip_py314.vgtest |
Adds Python 3.14-specific runtime obj-skip regression test. |
callgrind/tests/runtime_obj_skip_py314.stderr.exp |
Adds expected stderr for Python 3.14 test. |
callgrind/tests/runtime_obj_skip_py314.post.exp |
Adds expected post output for Python 3.14 test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| call_entry* ce = &cs->entry[cs->sp]; | ||
| ce->jcc = 0; | ||
| ce->sp = sps[frame]; |
| CLG_(insert_bbcc_into_hash)(b); | ||
| } | ||
| caller_bbcc = b; |
|
|
||
| print_bbccs(trigger, only_current_thread); | ||
|
|
||
| CLG_(dump_python_fn_summary)(); |
| if (!skip && CLG_(clo).objs_to_skip_count > 0) { | ||
| VG_(message)(Vg_UserMsg, | ||
| "obj_skip miss: fn='%s' obj='%s' (len=%lu, %d entries) jmpkind=%d\n", | ||
| node->name, obj_name, | ||
| VG_(strlen)(obj_name), CLG_(clo).objs_to_skip_count, | ||
| (int)jmpkind); | ||
| for (int i=0; i<CLG_(clo).objs_to_skip_count; i++) | ||
| VG_(message)(Vg_UserMsg, " vs [%d] strcmp=%d (len=%lu) '%s'\n", | ||
| i, cmp_results[i], | ||
| VG_(strlen)(CLG_(clo).objs_to_skip[i]), | ||
| CLG_(clo).objs_to_skip[i]); |
| VG_(message)(Vg_UserMsg, "new_fn_node: fn='%s' obj='%s'\n", | ||
| fn->name, | ||
| (file && file->obj && file->obj->name) ? file->obj->name : "(null)"); |
| if (!skip && !node->obj_skip_checked){ | ||
| HChar* obj_name = node->file->obj->name; | ||
| // VG_(printf)(" %s\n", obj_name); | ||
| Int cmp_results[CLG_(clo).objs_to_skip_count]; |
The native-stack seed in CLG_(reconstruct_call_stack_from_native) was pushing every frame, including frame 0 (the function that fired CALLGRIND_START_INSTRUMENTATION). Seeded call_entries have jcc=0, and pop_call_stack only restores cxt when jcc!=0 — so on frame 0's `ret` the function stayed stuck on top of the cxt chain and phantom-parented every subsequent call from the real caller. Visible in the CodSpeed flamegraph as callgrind_start_instrumentation appearing as the parent of the benchmark body. Stop the seed loop at frame 1; the trailing epilogue of frame 0 is harmlessly attributed to the caller's cxt. Add callgrind/tests/phantom_root as a minimal red/green reproducer: a start_and_return() function fires the macro and returns, then main calls leaf(). Post-check asserts leaf's only caller is main.
The previous fix only stopped phantom-parenting from frame 0. The bug is structural: every seeded call_entry has jcc=0, and pop_call_stack only restores cxt when jcc!=0. With a multi-frame wrapper chain (e.g. Node.js: macro -> C export -> N-API trampoline -> user JS), each ret on the way out of the chain fails to pop the cxt, leaving the deepest un-returned wrapper stuck on top. Fix it where the asymmetry lives: in pop_call_stack, also restore cxt when the entry has a non-zero saved cxt. push_cxt always populates lower_entry->cxt, while real skip-entries (push_call_stack(skip=True) without a prior push_cxt) leave it at 0 — so the new branch fires only for seeded entries. With cxt now popping cleanly through the whole seeded chain, the frame-0 carve-out is no longer needed; the seed loop walks all native frames again. Extend phantom_root to a 3-deep wrapper chain (wrapper_outer -> wrapper_middle -> wrapper_inner fires START and unwinds) so the multi-frame case is locked under test.
The previous fix correctly restored cxt on the way back out of named
wrapper frames, but Node.js / Python JIT scenarios have an anonymous
frame between the codspeed C wrappers and user code (a V8 trampoline
or JIT-generated trampoline, mmap'd in an anonymous region with no
DebugInfo). CLG_(get_fn_node_for_addr) resolves that frame to
fn->name == "???". The seed used to push_cxt that "???" fn onto the
chain, but the JIT frame never RETs via C-ABI before user code runs,
so the "???" entry stayed stuck on top of cxt and phantom-parented
every user fn — visible in a CodSpeed flamegraph as a "???" or
0x{addr} row sitting between the benchmark label and the actual user
function, holding the bulk of the inclusive cost.
Skip push_cxt and BBCC synthesis for these frames. Still push a bare
call_entry so SP-based unwind works for the underflow case; ce->cxt
stays 0, which the pop_call_stack else-branch already treats as
"no-op cxt restore".
Verified against a Node.js benchmark under the codspeed runner: the
"???" root previously holding 99.99% of inclusive cost is gone; user
JS frames now appear directly under the benchmark.
The phantom_root regression test (named-wrapper chain) still passes.
An anonymous-frame C reproducer (mmap'd thunk between main and the
wrapper) was attempted but hit an unrelated (below main) attribution
issue and was dropped — the Node.js manual reproducer covers this
case end-to-end.
No description provided.