Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ For `reprocess`, the pipeline runs `cocoindex` with `cwd` set to the bundle dire

## 6. Graph layer

A deterministic property graph derived from tree-sitter Java parsing lives next to the LanceDB tables under the index directory (default `${JAVA_CODEBASE_RAG_INDEX_DIR:-./.java-codebase-rag}/code_graph.kuzu`). Current ontology version: **14** (see [`docs/EDGE-NAVIGATION.md`](./docs/EDGE-NAVIGATION.md) for edge shapes).
A deterministic property graph derived from tree-sitter Java parsing lives next to the LanceDB tables under the index directory (default `${JAVA_CODEBASE_RAG_INDEX_DIR:-./.java-codebase-rag}/code_graph.kuzu`). Current ontology version: **15** (see [`docs/EDGE-NAVIGATION.md`](./docs/EDGE-NAVIGATION.md) for MCP-traversable edge shapes).

### Node kinds

Expand All @@ -370,10 +370,11 @@ A deterministic property graph derived from tree-sitter Java parsing lives next
| `Symbol` | `package`, `file`, `class`, `interface`, `enum`, `record`, `annotation`, `method`, `constructor` |
| `Route` | HTTP endpoint or async listener (one row per declared route) |
| `Client` | Outbound HTTP / messaging call site |
| `UnresolvedCallSite` | Receiver-failure call site (`chained_receiver`, `phantom_unresolved_receiver`) — not a `Symbol`; ids use the `ucs:` prefix |

Unresolved targets become **phantom** nodes (`resolved=false`, FQN guessed from imports / `java.lang`).
Known-receiver-external JDK / Spring / Lombok callees stay on **`CALLS`** as phantom **method** symbols (`resolved=false`). Receiver-failure sites (unresolved receiver or chained receiver) are **`UnresolvedCallSite`** nodes linked by **`UNRESOLVED_AT`** (not in `EDGE_SCHEMA`; use `describe(method_id).unresolved_call_sites`, `neighbors(..., include_unresolved=True)`, or `java-codebase-rag unresolved-calls`).

### Edge types (10)
### Edge types (MCP-traversable)

| Edge | Direction | Meaning |
|---|---|---|
Expand All @@ -388,7 +389,7 @@ Unresolved targets become **phantom** nodes (`resolved=false`, FQN guessed from
| `HTTP_CALLS` | client → route | Cross-service HTTP call (caller-side Client to target Route). |
| `ASYNC_CALLS` | producer → route | Cross-service async (Kafka, Rabbit, JMS, …). |

JDK / Spring / Lombok callees are represented as **phantom** method symbols at index time. Caller/callee traversals default to `exclude_external=true` so those edges are filtered by FQN prefix without dropping them from the graph.
Caller/callee traversals default to `exclude_external=true` on **`find_callers`** so library FQN prefixes are filtered without dropping edges from the graph.

### Call-graph notes

Expand Down Expand Up @@ -426,7 +427,7 @@ Resolution order for `microservice`:

Current ontology version is **15**. Any index built before this version must be rebuilt via `cocoindex update ... --full-reprocess -f` or a full `java-codebase-rag reprocess` (no selective flags) so vectors and graph stay aligned. Until re-indexed, the server defensively JSON-decodes string-form list columns so nothing explodes, but filters like `array_contains` will not work.

Ontology **15** (CALLS-NOISE PR-1) adds `CALLS.callee_declaring_role`, `GraphMeta.pass3_unresolved_phantom_receiver` / `pass3_unresolved_chained`, and **supertype-walk dedup** at build time: duplicate interface + concrete candidates at the same call site collapse to one `CALLS` row (row counts per method may drop after re-index, not only a new column). PR-2 adds `edge_filter` on `neighbors`; PR-3 moves true receiver-failure rows off `CALLS`.
Ontology **15** (CALLS-NOISE) adds `CALLS.callee_declaring_role`, `GraphMeta.pass3_unresolved_phantom_receiver` / `pass3_unresolved_chained`, and **supertype-walk dedup** at build time. PR-2 adds `edge_filter` on `neighbors`. **PR-3 (breaking):** receiver-failure sites (`chained_receiver`, unresolved-receiver `phantom`) are no longer `CALLS` rows — they live on `UnresolvedCallSite` + `UNRESOLVED_AT`. Default `neighbors(..., ['CALLS'])` returns fewer rows; use `include_unresolved=True` for a source-ordered interleaved transcript (`row_kind`), `describe(method_id).unresolved_call_sites` (capped), or `java-codebase-rag unresolved-calls list|stats`. Known-receiver-external JDK rows stay on `CALLS` with `resolved=false`.

Ontology **14** introduces `EDGE_SCHEMA` in `java_ontology.py` as the canonical edge navigation schema (see `docs/EDGE-NAVIGATION.md`). **`HTTP_CALLS` is `Client → Route`** (SCHEMA-V2 PR-B). **`ASYNC_CALLS` is `Producer → Route`** with `DECLARES_PRODUCER` (SCHEMA-V2 PR-C). Run one full reprocess after upgrading through the SCHEMA-V2 sequence (or when you need the v14 ontology gate).

Expand Down
125 changes: 95 additions & 30 deletions build_ast_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,18 @@ class CallsRow:
callee_declaring_role: str = "OTHER"


@dataclass
class UnresolvedCallSiteRow:
id: str
caller_id: str
call_site_line: int
call_site_byte: int
arg_count: int
callee_simple: str
receiver_expr: str
reason: str


@dataclass
class DeclaresRow:
src_id: str
Expand Down Expand Up @@ -363,6 +375,7 @@ class GraphTables:
implements_rows: list[EdgeRow] = field(default_factory=list)
injects_rows: list[InjectsRow] = field(default_factory=list)
calls_rows: list[CallsRow] = field(default_factory=list)
unresolved_call_site_rows: list[UnresolvedCallSiteRow] = field(default_factory=list)
declares_rows: list[DeclaresRow] = field(default_factory=list)
routes_rows: list[RouteRow] = field(default_factory=list)
exposes_rows: list[ExposesRow] = field(default_factory=list)
Expand Down Expand Up @@ -1209,6 +1222,34 @@ def _collapse_supertype_duplicates(
return [concrete]


def _unresolved_call_site_id(caller_id: str, call: CallSite) -> str:
return f"ucs:{caller_id}:{call.line}:{call.byte}"


def _emit_unresolved_call_site(
tables: GraphTables,
stats: CallResolutionStats,
*,
caller_id: str,
call: CallSite,
reason: str,
) -> None:
tables.unresolved_call_site_rows.append(UnresolvedCallSiteRow(
id=_unresolved_call_site_id(caller_id, call),
caller_id=caller_id,
call_site_line=call.line,
call_site_byte=call.byte,
arg_count=call.arg_count,
callee_simple=call.callee_simple,
receiver_expr=call.receiver_expr or "",
reason=reason,
))
if reason == "chained_receiver":
stats.phantom_chained += 1
else:
stats.phantom_other += 1


def _emit_call_edge(
tables: GraphTables,
stats: CallResolutionStats,
Expand All @@ -1235,14 +1276,7 @@ def _emit_call_edge(
))
stats.total += 1
stats.by_strategy[strategy] += 1
if strategy == "chained_receiver":
stats.phantom_chained += 1
elif strategy == "phantom":
# Only count as phantom_other when the receiver itself was unresolvable.
# High-confidence edges with phantom callees (resolved=False, strategy!=phantom)
# are not noise — they are known external calls with good receiver resolution.
stats.phantom_other += 1
if not resolved and strategy != "chained_receiver":
if not resolved:
stats.callee_unresolved += 1


Expand All @@ -1268,26 +1302,17 @@ def _resolve_and_emit_call(
recv_type, strat, conf = _resolve_receiver_type(call, scope=scope, member=member, ast=ast, tables=tables)

if strat == "chained_receiver":
# Chained-receiver phantoms have no microservice attribution, so they cannot violate cross-service CALLS invariants.
pid = _phantom_method_id(
tables, receiver_fqn=None, receiver_expr=call.receiver_expr,
callee=call.callee_simple, arg_count=call.arg_count,
)
_emit_call_edge(
tables, stats, src_id=member.node_id, dst_id=pid, call=call,
confidence=0.0, strategy="chained_receiver", resolved=False,
_emit_unresolved_call_site(
tables, stats, caller_id=member.node_id, call=call, reason="chained_receiver",
)
return

if recv_type is None:
# Unresolved-receiver phantoms also carry empty microservice attribution.
pid = _phantom_method_id(
tables, receiver_fqn=None, receiver_expr=call.receiver_expr,
callee=call.callee_simple, arg_count=call.arg_count,
)
_emit_call_edge(
tables, stats, src_id=member.node_id, dst_id=pid, call=call,
confidence=0.0, strategy="phantom", resolved=False,
_emit_unresolved_call_site(
tables, stats,
caller_id=member.node_id,
call=call,
reason="phantom_unresolved_receiver",
)
return

Expand Down Expand Up @@ -1413,16 +1438,18 @@ def pass3_calls(tables: GraphTables, asts: dict[str, JavaFileAst], *, verbose: b
_process_file_calls(file_ast, rel_path, tables, stats)
except Exception as e:
log.error("Call extraction failed for %s: %s", rel_path, e)
pct_chained = 100.0 * stats.phantom_chained / max(1, stats.total)
pct_callee_unres = 100.0 * stats.callee_unresolved / max(1, stats.total)
pct_phantom_recv = 100.0 * stats.phantom_other / max(1, stats.total)
denom_calls = max(1, stats.total)
denom_sites = max(1, stats.total + stats.phantom_chained + stats.phantom_other)
pct_chained = 100.0 * stats.phantom_chained / denom_sites
pct_callee_unres = 100.0 * stats.callee_unresolved / denom_calls
pct_phantom_recv = 100.0 * stats.phantom_other / denom_sites
tables.pass3_skipped_cross_service = int(stats.skipped_cross_service)
tables.pass3_unresolved_phantom_receiver = int(stats.phantom_other)
tables.pass3_unresolved_chained = int(stats.phantom_chained)
msg = (
f"Call resolution: {stats.total} sites, {stats.phantom_chained} chained phantoms "
f"({pct_chained:.1f}%), {stats.callee_unresolved} unresolved callee "
f"({pct_callee_unres:.1f}%), {stats.phantom_other} phantom receiver "
f"Call resolution: {stats.total} CALLS rows, {stats.phantom_chained} chained unresolved "
f"({pct_chained:.1f}%), {stats.callee_unresolved} unresolved callee on CALLS "
f"({pct_callee_unres:.1f}%), {stats.phantom_other} phantom-receiver unresolved "
f"({pct_phantom_recv:.1f}%), {stats.skipped_cross_service} skipped cross-service, "
f"strategies: {dict(stats.by_strategy)}"
)
Expand Down Expand Up @@ -2406,6 +2433,13 @@ def _micro_factor(member: MemberEntry | None) -> float:
"confidence DOUBLE, strategy STRING, source STRING, resolved BOOLEAN, "
"callee_declaring_role STRING)"
)
_SCHEMA_UNRESOLVED_CALL_SITE = (
"CREATE NODE TABLE UnresolvedCallSite("
"id STRING, caller_id STRING, call_site_line INT64, call_site_byte INT64, "
"arg_count INT64, callee_simple STRING, receiver_expr STRING, reason STRING, "
"PRIMARY KEY(id))"
)
_SCHEMA_UNRESOLVED_AT = "CREATE REL TABLE UNRESOLVED_AT(FROM Symbol TO UnresolvedCallSite)"
_SCHEMA_EXPOSES = (
"CREATE REL TABLE EXPOSES(FROM Symbol TO Route, "
"confidence DOUBLE, strategy STRING)"
Expand Down Expand Up @@ -2437,12 +2471,14 @@ def _drop_all(conn: kuzu.Connection) -> None:
"DROP TABLE IF EXISTS HTTP_CALLS",
"DROP TABLE IF EXISTS ASYNC_CALLS",
"DROP TABLE IF EXISTS EXPOSES",
"DROP TABLE IF EXISTS UNRESOLVED_AT",
"DROP TABLE IF EXISTS EXTENDS",
"DROP TABLE IF EXISTS IMPLEMENTS",
"DROP TABLE IF EXISTS INJECTS",
"DROP TABLE IF EXISTS CALLS",
"DROP TABLE IF EXISTS OVERRIDES",
"DROP TABLE IF EXISTS DECLARES",
"DROP TABLE IF EXISTS UnresolvedCallSite",
"DROP TABLE IF EXISTS Symbol",
"DROP TABLE IF EXISTS Route",
"DROP TABLE IF EXISTS Client",
Expand All @@ -2458,6 +2494,7 @@ def _drop_all(conn: kuzu.Connection) -> None:
def _create_schema(conn: kuzu.Connection) -> None:
for stmt in (
_SCHEMA_NODE,
_SCHEMA_UNRESOLVED_CALL_SITE,
_SCHEMA_ROUTE,
_SCHEMA_CLIENT,
_SCHEMA_PRODUCER,
Expand All @@ -2468,6 +2505,7 @@ def _create_schema(conn: kuzu.Connection) -> None:
_SCHEMA_DECLARES,
_SCHEMA_OVERRIDES,
_SCHEMA_CALLS,
_SCHEMA_UNRESOLVED_AT,
_SCHEMA_EXPOSES,
_SCHEMA_DECLARES_CLIENT,
_SCHEMA_DECLARES_PRODUCER,
Expand Down Expand Up @@ -2743,6 +2781,33 @@ def _write_edges(conn: kuzu.Connection, tables: GraphTables) -> None:
),
})

_CREATE_UNRESOLVED = (
"CREATE (:UnresolvedCallSite {"
"id: $id, caller_id: $caller_id, call_site_line: $line, call_site_byte: $byte, "
"arg_count: $argc, callee_simple: $callee, receiver_expr: $recv, reason: $reason"
"})"
)
_CREATE_UNRESOLVED_AT = (
"MATCH (a:Symbol {id: $caller}), (u:UnresolvedCallSite {id: $ucs}) "
"CREATE (a)-[:UNRESOLVED_AT]->(u)"
)
seen_ucs: set[str] = set()
for row in tables.unresolved_call_site_rows:
if row.id in seen_ucs:
continue
seen_ucs.add(row.id)
conn.execute(_CREATE_UNRESOLVED, {
"id": row.id,
"caller_id": row.caller_id,
"line": row.call_site_line,
"byte": row.call_site_byte,
"argc": row.arg_count,
"callee": row.callee_simple,
"recv": row.receiver_expr,
"reason": row.reason,
})
conn.execute(_CREATE_UNRESOLVED_AT, {"caller": row.caller_id, "ucs": row.id})


def _write_routes_and_exposes(conn: kuzu.Connection, tables: GraphTables) -> None:
for row in tables.routes_rows:
Expand Down
4 changes: 2 additions & 2 deletions docs/AGENT-GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,13 +213,13 @@ Identifier lookup; three statuses above. Args: `identifier`, optional `hint_kind

One hop. Args: `ids` (string or array), **`direction`**, **`edge_types`**, `limit` (default 25), `offset`, optional `filter` on the other node, optional **`edge_filter`** (`edge_types` must be exactly `['CALLS']` — no composed dot-keys or second stored label; fail-loud otherwise).

**Multiple origin ids:** each id loads the full CALLS stream (or generic hop) in list order; `offset`/`limit` apply to the **concatenated** edge list (`ids[0]` edges first, then `ids[1]`, …), not global source order across origins — a large first origin can leave no rows for later ids within the same page. High fan-out methods are slow; prefer one id per call or a smaller `limit`.
**Multiple origin ids:** each id loads the full CALLS stream (or generic hop) in list order; `offset`/`limit` apply to the **concatenated** edge list (`ids[0]` edges first, then `ids[1]`, …), not global source order across origins — a large first origin can leave no rows for later ids within the same page. High fan-out methods are slow; prefer one id per call or a smaller `limit`. **Hints:** `TPL_NEIGHBORS_CALLS_HIGH_FANOUT` / `TPL_NEIGHBORS_CALLS_HAS_UNRESOLVED` fire only for a **single** origin id (multi-origin CALLS skips those nudges).

Returns **edges** with `attrs` (`confidence`, `strategy`, `match`, … on cross-service edges) and **`other`** node.

**Cross-service edges** (`HTTP_CALLS`, `ASYNC_CALLS`): read `attrs.confidence` and `attrs.match` — low confidence or `unresolved`/`phantom`/`ambiguous` means treat as a resolver signal, not ground truth.

**`CALLS` edges:** source-ordered (`call_site_line`, `call_site_byte`). `attrs.resolved=false` or low `attrs.confidence` may be JDK/external or unresolved static sites — still a lower bound, not exhaustive runtime behaviour. **`filter` + `edge_filter` together** load the ordered CALLS stream then apply callee `NodeFilter` in Python — expect higher latency on hot methods than `edge_filter` alone. Optional **`edge_filter`** projects before pagination: `min_confidence`; `include_strategies` / `exclude_strategies` (mutually exclusive); `callee_declaring_role`, `callee_declaring_roles`, `exclude_callee_declaring_roles` (`["OTHER"]` also drops known-external rows). **`filter.role` filters the neighbor method (usually `OTHER`), not the callee stereotype** — use `edge_filter.callee_declaring_role` for repository/service hops. **`exclude_external` applies to `find_callers` / `find_callees` only** (FQN-prefix); trim JDK noise on CALLS via `edge_filter`. Accessor noise: role excludes help; getter/setter heuristics in [`propose/AGENT-SKILLS-AND-COMMANDS-PROPOSE.md`](../propose/AGENT-SKILLS-AND-COMMANDS-PROPOSE.md) `/mini-map`.
**`CALLS` edges:** source-ordered (`call_site_line`, `call_site_byte`). After ontology 15 PR-3, true receiver-failure sites are **not** on `CALLS` — they are `UnresolvedCallSite` nodes (`reason`: `chained_receiver` or `phantom_unresolved_receiver`; ids use the `ucs:` prefix, `other.kind=unresolved_call_site` — **not** describable via `describe(id=…)`). `UNRESOLVED_AT` is graph storage only (not in `EDGE_SCHEMA` / `neighbors` edge_types). `attrs.resolved=false` on remaining `CALLS` rows means known-receiver-external (JDK/Spring) callees, not receiver failure. **`include_unresolved=True`** (CALLS + `direction=out` only) interleaves unresolved sites with resolved `CALLS` (`row_kind` discriminator); **mutually exclusive with `edge_filter`**. **`dedup_calls=True`** collapses identical `(origin, callee)` `CALLS` to one row with `call_site_lines`. **`filter` + `edge_filter` together** load the ordered CALLS stream then apply callee `NodeFilter` in Python — expect higher latency on hot methods than `edge_filter` alone. Optional **`edge_filter`** projects before pagination: `min_confidence`; `include_strategies` / `exclude_strategies` (mutually exclusive); `callee_declaring_role`, `callee_declaring_roles`, `exclude_callee_declaring_roles` (`["OTHER"]` also drops known-external rows). **`filter.role` filters the neighbor method (usually `OTHER`), not the callee stereotype** — use `edge_filter.callee_declaring_role` for repository/service hops. **`exclude_external` applies to `find_callers` / `find_callees` only** (FQN-prefix); trim JDK noise on `neighbors` CALLS via `edge_filter`. Accessor noise: role excludes help; getter/setter heuristics in [`propose/AGENT-SKILLS-AND-COMMANDS-PROPOSE.md`](../propose/AGENT-SKILLS-AND-COMMANDS-PROPOSE.md) `/mini-map`.

### Ontology glossary

Expand Down
17 changes: 17 additions & 0 deletions docs/EDGE-NAVIGATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,3 +252,20 @@
- `member_subject`: neighbors(['{id}'],'out',['DECLARES_PRODUCER']) then neighbors(producer_ids,'out',['ASYNC_CALLS'])
- `route_subject`: neighbors(['{id}'],'in',['ASYNC_CALLS']) then neighbors(producer_ids,'in',['DECLARES_PRODUCER']) for declaring method
- `alien_subject`: ASYNC_CALLS connects Producer→Route; use DECLARES_PRODUCER from a method Symbol, or neighbors(producer_id,'out',['ASYNC_CALLS']) from a Producer id


## Graph storage (not MCP `neighbors` edge_types)

### `UnresolvedCallSite` + `UNRESOLVED_AT` (ontology 15 / CALLS-NOISE PR-3)

Receiver-failure call sites (`chained_receiver`, `phantom_unresolved_receiver`) are **not** `CALLS` rows. They are `UnresolvedCallSite` nodes (`id` prefix `ucs:`) linked from the caller method Symbol via `UNRESOLVED_AT`.

| Surface | How to read them |
| --- | --- |
| `describe(method_id)` | `record.data.unresolved_call_sites` (capped at 5) + footer when more exist |
| `neighbors(..., ['CALLS'], include_unresolved=True)` | Interleaved transcript; `row_kind='unresolved_call_site'`; `other.kind=unresolved_call_site` |
| CLI | `java-codebase-rag unresolved-calls list|stats` |

- **Not** in `EDGE_SCHEMA` — do not pass `UNRESOLVED_AT` to `neighbors(edge_types=…)`.
- **`describe(ucs:…)`** is invalid (fail-loud); describe the **caller method** instead.
- Fresh graphs: `CALLS.strategy` no longer includes `phantom` or `chained_receiver` for receiver failure (those literals remain on HTTP/ASYNC `match` and brownfield resolver sets).
Loading
Loading