Skip to content

fix(storage): ingest-sarif upsert no longer clobbers a finding's host node into a placeholder#167

Merged
theagenticguy merged 3 commits into
mainfrom
fix/ingest-sarif-clobbers-finding-host-node
May 29, 2026
Merged

fix(storage): ingest-sarif upsert no longer clobbers a finding's host node into a placeholder#167
theagenticguy merged 3 commits into
mainfrom
fix/ingest-sarif-clobbers-finding-host-node

Conversation

@theagenticguy
Copy link
Copy Markdown
Owner

Summary

Data-corruption fix. Any function, method, or class that contains a scanner finding was silently deleted from the code graph and replaced by a <placeholder> Route node. Cross-module context/impact then treated the host symbol as an unreferenced island.

This is the root cause of the field report's Issue 1 (headline): get_bedrock_client showed up orphaned into <external>/<unresolved> nodes, and blast-radius didn't cross module boundaries.

Root cause (found by execution, not inspection)

Two earlier hypotheses were refuted by repro before this one was confirmed:

  • The report guessed a SCIP FQN-vs-filepath alias mismatch — wrong; scip-python emits one symbol string for the def and every ref.
  • A multi-agent analysis guessed the lbug node COPY drops the row via empty-string sentinel struct-typing — wrong; encodeNodeCol passes all strings through identically. A 2-node and a full-584-node bulkLoad both persisted the node fine.

The actual mechanism:

  1. codehub analyze does a replace-mode bulkLoad of all real nodes. The host Function persists correctly here (verified).
  2. Then scan + ingest-sarif does a second bulkLoad(findingGraph, { mode: "upsert" }) where findingGraph holds only Finding nodes + FOUND_IN edges.
  3. A Semgrep finding at client.py:165 sits inside get_bedrock_client, so ingest adds Finding → Function:…get_bedrock_client. That Function id is not in the upsert batch.
  4. synthesizePlaceholderNodes minted a kind:Route placeholder for the missing endpoint, and upsert mergeNodes (DETACH DELETE + re-insert) destroyed the real Function.

Confirmed on disk: id Function:…get_bedrock_client stored kind=Route, name=<the id>, lines=null — the placeholder shape. The @cache decorator and src/ layout were red herrings; the discriminator is "host of a scanner finding", so this corrupted every indexed repo.

Fix

synthesizePlaceholderNodes now takes an optional alreadyPersisted id set. In upsert mode, #bulkLoadOnce computes it via filterExistingNodeIds(edgeEndpointIds(edges)) (a thin wrapper over the existing listNodes({ ids }) finder) and skips synthesis for any edge endpoint that already exists in the store. Genuine orphans (unresolved FETCHES placeholders) still get synthesized so the edge COPY's PK constraint holds. Replace mode is unchanged (store was just truncated → undefined).

Test plan

  • End-to-end on the subject repo: after re-analyze, context get_bedrock_client resolves the Function with 11 inbound (incl. cross-module mcp_server.py:nova_web_grounding) + 5 outbound; impact tiers it CRITICAL, 124 reachable, crossing the module boundary. Before: orphaned, island.
  • 2 regression tests in graphdb-roundtrip.test.ts: (1) upsert with a FOUND_IN edge to an existing node keeps it a Function (not clobbered into a Route placeholder); (2) a genuinely-orphan FETCHES endpoint still synthesizes a placeholder.
  • @opencodehub/storage: 161/161 pass (1 pre-existing vector-backend skip).
  • tsc --noEmit workspace clean; biome check clean.

Follow-ups (separate PRs, from the same field report)

Issues 2–6 (vulture .venv exclude, MCP-only→CLI wrappers, sql docs, status retrieval mode, doctor bandit[sarif] probe) plus the confirmed Python import-parser bugs (multi-line parenthesized imports dropped; src-layout dotted-import resolution) are queued.

… node into a placeholder

A function/method/class that contains a scanner finding silently vanished
from the graph, replaced by a <placeholder> Route node. Cross-module
context/impact then treated the host as an unreferenced island — the
field-report Issue 1 symptom (get_bedrock_client orphaned, blast-radius
broken across module boundaries).

Root cause (found by execution, not inspection — two prior hypotheses, a
SCIP alias mismatch and an lbug COPY string-coercion drop, were both
refuted by repro):

  codehub analyze does a replace-mode bulkLoad of all real nodes (the host
  Function persists fine). Then scan + ingest-sarif does a SECOND bulkLoad
  in UPSERT mode with a graph holding ONLY Finding nodes + FOUND_IN edges.
  A finding inside get_bedrock_client adds Finding -> Function:...get_bedrock_client.
  That Function id is absent from the upsert batch, so
  synthesizePlaceholderNodes minted a kind:Route placeholder for it, and
  upsert mergeNodes (DETACH DELETE + re-insert) DESTROYED the real Function.
  Confirmed on disk: id Function:...get_bedrock_client stored kind=Route,
  name=<the id>, lines=null. The decorator and src-layout were red herrings;
  the discriminator is 'host of a scanner finding', so this corrupted every
  indexed repo, not just this one.

Fix: synthesizePlaceholderNodes takes an alreadyPersisted id set. In upsert
mode #bulkLoadOnce computes it via filterExistingNodeIds(edgeEndpointIds(edges))
(reuses listNodes({ids})) and skips synthesis for endpoints that already
exist in the store. Genuine orphans (unresolved FETCHES) still get a
placeholder so the edge COPY's PK constraint holds.

Verified end-to-end: after re-analyze, context get_bedrock_client shows 11
inbound (incl mcp_server.nova_web_grounding) + 5 outbound; impact tiers it
CRITICAL crossing the module boundary. Two regression tests in
graphdb-roundtrip.test.ts: (1) upsert edge to existing node keeps it a
Function, (2) genuine orphan endpoint still synthesizes. Storage 161/161
pass (1 pre-existing vector skip), tsc + biome clean.
@theagenticguy theagenticguy enabled auto-merge (squash) May 29, 2026 21:08
@theagenticguy theagenticguy merged commit 9e99936 into main May 29, 2026
32 of 34 checks passed
@theagenticguy theagenticguy deleted the fix/ingest-sarif-clobbers-finding-host-node branch May 29, 2026 21:28
theagenticguy added a commit that referenced this pull request May 29, 2026
…-module edges) (#170)

## Summary

Two bugs in Python import handling left cross-module `CALLS`/`IMPORTS`
edges unresolved, so a multi-file Python package read as disconnected
islands in the graph. Both are part of the field-report's **Issue 1**
(Bugs A + B) and are the foundation cross-module `context`/`impact`
depend on.

### Bug A — multi-line parenthesized imports silently dropped
`extractPyImports` was line-based. A multi-line `from m import (\n a,\n
b,\n)` matched the from-regex on the **first line only** with `rest =
"("` → 0 names → the whole import discarded. black/ruff wrap every long
import list this way, so most real modules lost their imports entirely.

**Fix:** `joinLogicalLines()` collapses physical lines into logical
lines across an open paren (depth count) or a trailing backslash, before
the per-line regex runs.

### Bug B — src-layout dotted absolute imports stub as `<external>`
`resolveImportTarget` only handled `./`, `../`, `/` specifiers. A dotted
**absolute** import (`pkg.client` → `src/pkg/client.py`) never resolved,
so it was emitted as a `CodeElement:<external>` stub instead of linking
the real file.

**Fix:** `resolveDottedAbsoluteImport()` converts dots→slashes and
probes the module at the repo root and under detected src-layout roots
(`discoverSourceRoots`), gated to `namespace` import-semantics languages
(Python). A dotted specifier that resolves to no in-repo file is still
treated as third-party (external stub) — unchanged.

## Test plan

- [x] **End-to-end on `ngs-research-agent`:** zero `<external>` stubs
for `ngs_research_agent.client`; real file→file IMPORTS edges
(`mcp_server.py → client.py`, `test_mcp_server.py → client.py`) now
land.
- [x] **Bug A regression** (`python.test.ts`): multi-line parenthesized
+ backslash-continued + single-line-parens import extraction.
- [x] **Bug B regression** (`parse.test.ts`): a src-layout dotted import
produces a file IMPORTS edge and **no** `<external>` stub.
- [x] `@opencodehub/ingestion` 604/604; `tsc` + `biome` clean.

## Relationship to PR #167
#167 fixed the `ingest-sarif` node-clobber that hid the host Function
node. This PR fixes the *import resolution* so cross-module edges
between real nodes actually bind. Together they restore trustworthy
cross-module blast-radius for Python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant