Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .cursor/rules/project-overview.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ when needed.
## Where to look

- `README.md` — feature surface, env vars, ranking, capabilities,
MCP tools (`search` / `find` / `describe` / `neighbors`), `java-codebase-rag` CLI,
MCP tools (`search` / `find` / `describe` / `neighbors` / `resolve`), `java-codebase-rag` CLI,
"Re-index required" callouts. The current
`ontology_version` is **12** (`@CodebaseHttpClient` rename + shared `CodebaseHttpMethod` enum;
inbound `@CodebaseHttpRoute` replaces same-method built-in HTTP rows; still
Expand Down
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ for tools that don't read `.cursor/rules/`.
## Where to look

- `README.md` — feature surface, env vars, ranking, capabilities,
MCP tool list (now `search` / `find` / `describe` / `neighbors`),
MCP tool list (`search` / `find` / `describe` / `neighbors` / `resolve`),
CLI ops (`java-codebase-rag --help`), and "Re-index required" callouts.
**`ontology_version` is currently 12** (HTTP brownfield rename + `CodebaseHttpMethod` enum + inbound HTTP layer-C replace; see README graph section).
- [`docs/JAVA-CODEBASE-RAG-CLI.md`](./docs/JAVA-CODEBASE-RAG-CLI.md) — operator guide for the `java-codebase-rag` CLI (`init` / `increment` / `reprocess` / `erase`, `meta`, `tables`, `diagnose-ignore`, `analyze-pr`; hidden `refresh` alias → `reprocess` — see that doc).
Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

A graph-native code intelligence layer for Java microservice estates, exposed to LLM agents via the **Model Context Protocol (MCP)**.

The system extracts a deterministic property graph from Java source (tree-sitter), stores it in **Kuzu** (graph) alongside a **LanceDB** vector index (chunks), and exposes a deliberately small MCP surface — **four tools**: `search`, `find`, `describe`, `neighbors` — that collapse onto three primitive agent operations: **locate**, **inspect**, **walk**.
The system extracts a deterministic property graph from Java source (tree-sitter), stores it in **Kuzu** (graph) alongside a **LanceDB** vector index (chunks), and exposes a deliberately small MCP surface — **five tools**: `search`, `find`, `describe`, `neighbors`, `resolve` — that collapse onto three primitive agent operations: **locate**, **inspect**, **walk**.

> **What this MCP is:** a **GPS for code navigation**, not a reasoning engine.
> Agents use a simple loop:
>
> 1. **Locate** entry nodes (`search` / `find`)
> 1. **Locate** entry nodes (`search` / `find`, or identifier-shaped **`resolve`**)
> 2. **Inspect** what a node is (`describe`)
> 3. **Walk** one hop at a time (`neighbors`) until enough evidence is gathered
>
Expand Down Expand Up @@ -229,7 +229,7 @@ Edit `claude_desktop_config.json` (macOS: `~/Library/Application Support/Claude/

### Driving the MCP from an agent

- **[`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md)** — copy-paste into `QWEN.md` / `CLAUDE.md` / `AGENTS.md`. Covers the four MCP tools, the shared `NodeFilter`, the edge-type taxonomy, required `neighbors` arguments, the ontology glossary (currently **v12**), the recovery playbook, and slash-style aliases.
- **[`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md)** — copy-paste into `QWEN.md` / `CLAUDE.md` / `AGENTS.md`. Covers the five MCP tools, the shared `NodeFilter`, the edge-type taxonomy, required `neighbors` arguments, the ontology glossary (currently **v12**), the recovery playbook, and slash-style aliases.
- **[`docs/skills/java-codebase-explore.md`](./docs/skills/java-codebase-explore.md)** — exploration **strategy** (missions, fallbacks, anti-capabilities, stopping rules); AGENT-GUIDE remains the **operating manual** for tool shapes and recovery.
- **[`docs/MANUAL-VERIFICATION-CHECKLIST.md`](./docs/MANUAL-VERIFICATION-CHECKLIST.md)** — 7-phase agent-driven verification you run after indexing your real project. Each item has a copy-paste prompt and calibration data from `tests/bank-chat-system`.
- **[`automation/cursor_propose_only/README.md`](./automation/cursor_propose_only/README.md)** — optional proposal orchestration workflow (single-command autopilot, planning bundles, and automated execution/review loops).
Expand All @@ -243,6 +243,7 @@ Edit `claude_desktop_config.json` (macOS: `~/Library/Application Support/Claude/
| `search` | Locate nodes by NL/code text. | `query: str`, `table: str="java"`, `hybrid: bool=False`, `limit: int=5`, `offset: int=0`, `path_contains: str \| None`, `filter: NodeFilter \| str \| None` | `{"query":"join operator flow","limit":5}` |
| `find` | Locate nodes by structured filter. | `kind: "symbol"\|"route"\|"client"`, `filter: NodeFilter \| str`, `limit: int=25`, `offset: int=0` | `{"kind":"symbol","filter":{"role":"CONTROLLER"}}` |
| `describe` | Full record + edge counts for one node. For **type** symbols, `edge_summary` may include composed dot-keys (`DECLARES.DECLARES_CLIENT`, `DECLARES.EXPOSES`); for **method** symbols it may include override-axis virtual keys (`OVERRIDDEN_BY`, `OVERRIDDEN_BY.DECLARES_CLIENT`, `OVERRIDDEN_BY.EXPOSES`, `OVERRIDES`). See [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md) (`describe`). | `id: str` | `{"id":"sym:com.bank.chat.core.api.ChatController#joinOperator(JoinOperatorRequest)"}` |
| `resolve` | Identifier-shaped node lookup (symbol / route / client). Returns `status` `one`, `many`, or `none`; prefer over `describe(fqn=…)` when an FQN may collide. See [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md) (`resolve`). | `identifier: str`, `hint_kind: "symbol"|"route"|"client" \| null` | `{"identifier":"com.bank.chat.core.api.ChatController","hint_kind":"symbol"}` |
| `neighbors` | One-hop walk. **Required**: `direction` and `edge_types`. | `ids: str \| list[str]`, `direction: "in"\|"out"`, `edge_types: list[str]`, `limit: int=25`, `offset: int=0`, `filter: NodeFilter \| str \| None` | `{"ids":"route:chat-core:POST:/chat/joinOperator","direction":"in","edge_types":["HTTP_CALLS","ASYNC_CALLS"]}` |

**`NodeFilter` notes:**
Expand Down
35 changes: 23 additions & 12 deletions docs/AGENT-GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
> **How to use this file.** Copy the block between the `<!-- BEGIN/END
> java-codebase-rag MCP guide -->` markers below into your project's `QWEN.md`,
> `CLAUDE.md`, `AGENTS.md`, or equivalent. The block is self-contained:
> **four** MCP navigation tools, one shared **`NodeFilter`**, edge-type
> **five** MCP navigation tools, one shared **`NodeFilter`**, edge-type
> taxonomy, a forced reasoning preamble, a decision tree, a recovery
> playbook, and slash-style prompt aliases. Update by re-pulling from this
> repo when the ontology bumps.
Expand Down Expand Up @@ -31,7 +31,7 @@ This MCP indexes Java enterprise projects into two stores:
- **LanceDB** — vector + optional hybrid (FTS + vector) search over Java / SQL / YAML chunks.
- **Kuzu graph** — exact structure: **node kinds** `Symbol`, `Route`, `Client` and **nine edge types** (see *Edge taxonomy* below).

**MCP surface (navigation only):** `search`, `find`, `describe`, `neighbors`.
**MCP surface (navigation only):** `search`, `find`, `describe`, `neighbors`, `resolve`.

**Operator / diagnostics (not MCP):** use the **`java-codebase-rag`** CLI — lifecycle (`init`, `increment`, `reprocess`, `erase`) plus `meta`, `tables`, `diagnose-ignore`, `analyze-pr`. Rebuilds are slow; the coding agent should not pretend it can reindex via MCP. For lifecycle commands, subprocess progress is written to **stderr** (use **`--quiet`** to suppress it); **stdout** is only the structured result payload.

Expand Down Expand Up @@ -65,7 +65,7 @@ When a method carries **`@CodebaseHttpRoute`** or **`@CodebaseHttpClient`** (inc

**Workflow (GPS model):**

1. **Locate** — `search` (natural language / fragment) or `find` (structured `NodeFilter`).
1. **Locate** — `resolve` when you hold an identifier-shaped string; `search` (natural language / fragment) or `find` (structured `NodeFilter`) for discovery.
2. **Inspect** — `describe(id)` to see the full record and `edge_summary` (per stored edge label `in`/`out` counts, plus optional composed dot-keys for type Symbols — see `describe` below).
3. **Walk** — `neighbors` in a loop with explicit **`direction`** and **`edge_types`** until you have enough evidence. Multi-hop “trace” and “impact” are **your** reasoning, not a separate tool.

Expand All @@ -75,7 +75,7 @@ Before every MCP tool call, output **one short line**:

```
Q-class: <semantic | structured | inspect | walk>
Pick: <search|find|describe|neighbors> Why: <≤8 words>
Pick: <search|find|describe|neighbors|resolve> Why: <≤8 words>
```

Then check *Argument shapes* (real JSON arrays/objects, required `neighbors` fields). If the call returns nothing useful, do not thrash — use the **Recovery playbook**.
Expand Down Expand Up @@ -128,7 +128,7 @@ When reading or comparing symbols, method identity uses **FQN + signature**:
- Simple type names in parentheses (`String`, `List`), generics erased (`List<String>` → `List`).
- No spaces after commas. No-arg: `()`. Constructor: `#<init>(...)`.

Use `search` to recover the stored symbol id / FQN if you only have a simple name.
Use `resolve` when the simple name (or FQN fragment) is identifier-shaped; use `search` for fuzzy ranked discovery if you need chunk context before you have a stable id.

#### D. `neighbors` — required arguments

Expand Down Expand Up @@ -161,9 +161,13 @@ The same `http_method` key filters HTTP verbs on **routes** (server-side declare
- **`search.query` is not a DSL** — treat it as opaque text scored against the index. Structured predicates belong in `find`.
- **`neighbors` filters neighbor rows by kind** — the first neighbor whose kind rejects the filter fails the whole call (no per-row silent skip).

### Identifier resolution (pre-`resolve`)
### Identifier resolution

For identifier-shaped lookups without a stable graph id or exact symbol FQN, use **`search(query=…)`** for ranked candidates, then **`describe(id=…)`** (or `describe(fqn=…)` when you have an exact FQN) on each promising row until you confirm the right node. A dedicated **`resolve`** tool is planned separately; until it ships, this multi-call pattern is the supported fallback.
Use **`resolve(identifier=…, hint_kind=…)`** for identifier-shaped inputs: canonical ids (`sym:` / `route:` / `client:`), symbol FQN or suffix, HTTP `METHOD /path`, route path template, client `target_service`, or `target_service` + path prefix pair. Omit `hint_kind` to match across all three node kinds when the string is enough to scope generators.

Branch on **`status`** in the response: **`one`** → `describe(id=…)` on the returned `node`; **`many`** → inspect `candidates` (each has a closed **`reason`**, **`score`** for display only, and a **`NodeRef`** including `microservice` when the row has one), pick one, then `describe(id=…)`; **`none`** (well-formed miss) → fall back to **`search(query=…)`** for natural-language or fuzzy discovery. Malformed empty / whitespace identifiers return `success=false` first.

Prefer **`resolve` → `describe(id=…)`** over **`describe(fqn=…)`** when an FQN might collide: `describe(fqn=…)` still returns the first graph row and a hint, but `resolve` makes ambiguity explicit.

**`source_layer` vs `role`:** On **Client** nodes, `source_layer` records which brownfield or built-in layer produced the client declaration (`builtin`, `layer_a_meta`, `layer_b_ann`, `layer_c_source`, `layer_b_fqn`, …). On **Symbol** nodes, `role` is the inferred architectural stereotype (`CONTROLLER`, `SERVICE`, `REPO`, …). They answer different questions; names stay distinct.

Expand All @@ -175,12 +179,13 @@ Exact allowed values for roles, capabilities, client kinds, etc. live in `java_o

| User asks… | First step | Typical follow-up |
| ---------- | ---------- | ----------------- |
| Identifier-shaped string (FQN, `sym:`/ `route:` / `client:` id, route path, client target, …) | `resolve` (optional `hint_kind`) | `describe` → `neighbors` |
| Fuzzy / NL “where is X” | `search` | `describe` → `neighbors` |
| All controllers in service S | `find(kind="symbol", filter={"microservice":S,"role":"CONTROLLER"})` | `neighbors` for `CALLS` / `EXPOSES` |
| List interfaces in service S | `find(kind="symbol", filter={"microservice":S,"symbol_kind":"interface"})` | `neighbors` / `describe` |
| List HTTP or Kafka entry points | `find(kind="route", filter={...})` | `describe` |
| List Feign / HTTP clients | `find(kind="client", filter={...})` | `neighbors(..., out, ["HTTP_CALLS"])` if needed |
| Who calls method M? | Resolve id via `search` or `find` | `neighbors(ids=sym_id, direction="in", edge_types=["CALLS"])` |
| Who calls method M? | Stable symbol id via `resolve`, `find`, or `search` | `neighbors(ids=sym_id, direction="in", edge_types=["CALLS"])` |
| What does M call? | Same | `neighbors(..., direction="out", edge_types=["CALLS"])` |
| Who hits this route? | `find(kind="route", ...)` or route id from logs | `neighbors(ids=route_id, direction="in", edge_types=["HTTP_CALLS","ASYNC_CALLS","EXPOSES"])` |
| Handler for a route | Have `route_id` | `neighbors(ids=route_id, direction="in", edge_types=["EXPOSES"])` |
Expand All @@ -193,10 +198,10 @@ Exact allowed values for roles, capabilities, client kinds, etc. live in `java_o

**Rules of thumb:**

1. **Graph beats vector for exact structural questions** — do not `search` for “who calls `Foo#bar`” if you can use `find` + `neighbors(in, [CALLS])`.
1. **Graph beats vector for exact structural questions** — do not `search` for “who calls `Foo#bar`” if you can use `resolve` / `find` + `neighbors(in, [CALLS])`.
2. **Vector beats graph for fuzzy discovery** — `search` first, then pivot to `describe` / `neighbors`.

### Tool reference — four tools
### Tool reference — five tools

#### `search`

Expand All @@ -214,7 +219,7 @@ Exact allowed values for roles, capabilities, client kinds, etc. live in `java_o
#### `describe`

- **Purpose:** Full node payload + `edge_summary`: `in` / `out` counts **per stored graph edge label** (what exists as edges in Kuzu). For **type** Symbols only (`class`, `interface`, `enum`, `record`, `annotation`), the same map may also include **describe-time composed** dot-keys — summaries of member edges, not stored labels — see the next bullets (`DECLARES.DECLARES_CLIENT`, `DECLARES.EXPOSES`); those keys are **not** valid in `neighbors(edge_types=…)`. For **method** Symbols, the map may include **override-axis** virtual keys (`OVERRIDDEN_BY`, `OVERRIDDEN_BY.DECLARES_CLIENT`, `OVERRIDDEN_BY.EXPOSES`, `OVERRIDES`); see **Override-axis keys (method Symbols)** below — also not `EdgeType` literals.
- **Args:** `id` (symbol, route, or client id) or **`fqn`** (exact symbol FQN when you do not have the graph id). When both are set, `id` wins. For ambiguous identifiers without an exact id/FQN, see **Identifier resolution (pre-`resolve`)** above.
- **Args:** `id` (symbol, route, or client id) or **`fqn`** (exact symbol FQN when you do not have the graph id). When both are set, `id` wins. For identifier-shaped inputs and FQN collision handling, see **Identifier resolution** above.

**Composed `edge_summary` keys (type Symbols).** Keys use dot notation: `<parent_relation>.<projected_relation>`. Two are emitted today:

Expand All @@ -237,6 +242,12 @@ Static methods suppress the entire override-axis rollup. Constructors do not rec

These keys are **not** valid `EdgeType` literals — `neighbors(edge_types=["OVERRIDDEN_BY"])` fails at the Pydantic boundary. Use them as hop affordances only.

#### `resolve`

- **Purpose:** Identifier-shaped lookup across symbols, routes, and clients; returns `status` **`one`**, **`many`**, or **`none`** with optional `node` / `candidates` (see **Identifier resolution**).
- **Args:** `identifier` (string), optional `hint_kind` (`symbol` | `route` | `client`) to constrain generators.
- **Tip:** On **`many`**, use per-candidate `NodeRef.id` and `reason`; follow with **`describe(id=…)`**. On **`none`**, use **`search`** for fuzzy discovery.

#### `neighbors`

- **Purpose:** One hop over explicit edge types; returns **edges** with attributes (`confidence`, `strategy`, `match`, …) and the **`other`** node.
Expand Down Expand Up @@ -264,7 +275,7 @@ Source of truth: `java_ontology.py`. Strings are case-sensitive.
| ------- | ------------ | --- |
| `neighbors` validation error | Missing `direction` or `edge_types` | Add both explicitly |
| Empty `neighbors` | Wrong edge type for the node kind, or wrong direction | Check `describe.edge_summary`; `EXPOSES` is Symbol↔Route — direction matters |
| Cannot find symbol | Wrong id or stale index | `search` with distinctive string; verify `java-codebase-rag meta` (CLI) |
| Cannot find symbol | Wrong id or stale index | `resolve` / `search` with distinctive string; verify `java-codebase-rag meta` (CLI) |
| `find` returns too much | Over-broad filter | Add `microservice`, `fqn_prefix`, `path_prefix`, etc. |
| Route not found | Path mismatch | Use `path_prefix` on `find(kind="route", …)`; check README brownfield routes |
| Need ontology / rebuild / PR analysis | Wrong layer | Use **`java-codebase-rag`** CLI, not MCP |
Expand Down
2 changes: 1 addition & 1 deletion docs/JAVA-CODEBASE-RAG-CLI.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# `java-codebase-rag` CLI — operator guide

The **`java-codebase-rag`** command is the **operator surface** for this bundle: index lifecycle (`init` / `increment` / `reprocess` / `erase`), graph and Lance health (`meta`, `tables`), ignore diagnostics, and PR diff analysis. It is **not** the MCP tool surface (that is `search` / `find` / `describe` / `neighbors` only). For agents driving the MCP server, see [`AGENT-GUIDE.md`](./AGENT-GUIDE.md).
The **`java-codebase-rag`** command is the **operator surface** for this bundle: index lifecycle (`init` / `increment` / `reprocess` / `erase`), graph and Lance health (`meta`, `tables`), ignore diagnostics, and PR diff analysis. It is **not** the MCP navigation surface (that is `search` / `find` / `describe` / `neighbors` / `resolve` on the MCP server — this CLI is lifecycle and introspection only). For agents driving the MCP server, see [`AGENT-GUIDE.md`](./AGENT-GUIDE.md).

## Install and discovery

Expand Down
4 changes: 2 additions & 2 deletions docs/MANUAL-VERIFICATION-CHECKLIST.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Use this **after** you've read `README.md` + `CODEBASE_REQUIREMENTS.md`,
[`docs/AGENT-GUIDE.md`](./AGENT-GUIDE.md), applied any brownfield annotations,
and built the index against your real project. The checklist mixes **shell**
checks (`java-codebase-rag` CLI for graph health and Lance tables) with **MCP**
checks (`search` / `find` / `describe` / `neighbors` — the only navigation
tools).
checks (`search` / `find` / `describe` / `neighbors` / `resolve` — the MCP
navigation tools).

Each item has:

Expand Down
5 changes: 4 additions & 1 deletion docs/skills/java-codebase-explore.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ On any new estate, **enumerate before you search-hunt**:
2. **`find(kind="client", …)`** — list outbound clients (Feign, `RestTemplate`, Kafka, etc.) and targets.
3. Optionally **`find(kind="symbol", filter={"role":"CONTROLLER"})`** (or equivalent `NodeFilter`) to anchor web entrypoints.

**Identifier-shaped** strings (FQN, `sym:` / `route:` / `client:` id, route path, client target): start with **`resolve`**, then **`describe(id=…)`**. Use **`search`** / **`find`** for discovery when you do not have a concrete identifier yet — not as the primary chain for identifier disambiguation.

You cannot reason reliably about cross-service behaviour until these surfaces exist in your working mental model (or you have consciously fallen back to non-MCP discovery).

## Mission catalogue
Expand Down Expand Up @@ -218,8 +220,9 @@ disagreement as evidence of staleness, not as a contradiction.

## Cheat sheet (inline reference)

Four MCP tools:
Five MCP tools:

- `resolve(identifier, hint_kind)` — identifier-shaped lookup (`one` / `many` / `none`).
- `search(query, table, hybrid, limit, filter)` — fuzzy locate.
- `find(kind, filter, limit)` — structured listing; `filter` is required.
- `describe(id)` — full node + `edge_summary`.
Expand Down
Loading
Loading