docs: agent guide + manual verification checklist (calibrated on bank-chat-system) by HumanBean17 · Pull Request #35 · HumanBean17/java-codebase-rag

HumanBean17 · 2026-05-06T09:47:48Z

Summary

Two new docs to make this MCP usable on real enterprise Java projects, especially with weaker LLMs (Qwen Code, Sonnet-mini, etc.) that previously got confused selecting and invoking tools — leveraging the work done in PRs #25–#33.

`docs/AGENT-GUIDE.md` (387 lines)

Copy-paste-into-QWEN.md / CLAUDE.md block (delimited with  markers) containing:

Forced reasoning preamble — every tool call gets a one-line Q-class: + Pick: + Why: prefix. Cheap scaffolding that dramatically improves tool selection on mid-sized models.
Decision tree — user intent → first tool → typical follow-up. Headlines: graph beats vector for exact questions; vector beats graph for fuzzy ones.
Full reference for all 22 MCP tools — grouped Search · Symbols · Routes · Calls · Roles/Caps · Behavioural · Index mgmt. Each tool: 1-line purpose, required args, common mistakes (⚠), 1 minimal example.
Ontology glossary (v9) — verbatim enum values for roles, capabilities, VALID_ROUTE_FRAMEWORKS, VALID_ROUTE_KINDS, VALID_CLIENT_KINDS, VALID_CALL_MATCHES, sourced from java_ontology.py.
Recovery playbook — 7 common failure modes (zero callers, missing route, wrong role, no cross-service edges, …) mapped to concrete fixes.
Slash-style aliases — /who-calls, /calls-from, /route, /handler, /who-hits, /why-no-route, /role-of, /impact, /cross-service, /flow, /diff-risk, /health. Just prompt templates, not real commands — but they reliably steer weak models.

`docs/MANUAL-VERIFICATION-CHECKLIST.md` (636 lines)

Agent-driven manual test plan to run after indexing your real project. 7 phases, ~24 items total:

Phase	Items	Focus
1. Index health	4	`ontology_version=9`, parse error rate, symbol counts, LanceDB tables
2. Roles & capabilities	5	Controllers, services, Feign clients, listeners/producers, OTHER fraction
3. Routes	4	Framework distribution, controllers expose routes, paths/methods non-empty, Kafka topics
4. Call graph	3	`find_callers` matches IDE "Find Usages", end-to-end chain via `find_callees`, phantom rate
5. Cross-service edges	3	`HTTP_CALLS` resolution, `ASYNC_CALLS` producer↔listener, `cross_service_resolution` flag
6. Semantic search	2	Concept queries return relevant chunks, `auto_hybrid` for identifiers
7. Brownfield overrides	3	`@CodebaseRole` flips role, `@CodebaseRoute` registers route, `@CodebaseClient` creates outbound edge

Each item has:

☐ checkbox
Copy-paste verification prompt (the agent calls MCP tools and reports back)
Expected (calibration) column with concrete numbers from tests/bank-chat-system
If failing → fix pointer

Plus pre-flight (build the graph), per-phase "red flags" sections, and post-completion guidance.

`README.md` (+11 lines)

Added "Driving this MCP from an agent" callout right under the existing "Tuning for your codebase" callout, linking both new docs.

Calibration source

tests/bank-chat-system indexed with master @ d62b48c (post PR-H1, ontology version 9):

ontology_version: 9
counts: 84 files, 92 types, 474 members, 17 routes, 793 calls, 2 HTTP_CALLS, 5 ASYNC_CALLS
parse_errors: 0
routes_by_framework: {spring_mvc: 9, kafka: 2}
roles: CONTROLLER=5, SERVICE=7, COMPONENT=22, ENTITY=10, DTO=9, CONFIG=4, OTHER=43
capabilities: MESSAGE_LISTENER=2, MESSAGE_PRODUCER=2, SCHEDULED_TASK=1
microservices: chat-core (55 types), chat-assign (31 types)

Tests

290 passed, 4 skipped in 52.56s — no code change, baseline unchanged.

Future work (not in this PR)

Once usage patterns shake out from real enterprise project rollouts, split AGENT-GUIDE.md into per-skill files (skills/codebase-search.md, skills/route-investigation.md, …) that an orchestrator pulls in on demand. The monolith is the right starting point — easier to copy-paste into a single QWEN.md.
Consider auto-generating the tool reference section from server.py introspection (the name=, description=, and Field() blocks) so it never drifts from the actual tool surface.

…-chat-system) Two new docs targeting real-world rollout on enterprise Java projects: docs/AGENT-GUIDE.md — copy-paste-into-QWEN.md/CLAUDE.md block. - Forced reasoning preamble (Q-class + Pick + Why on every call) - Decision tree mapping user intent to first tool - Full reference for all 22 MCP tools with required args, common mistakes, minimal examples - Ontology glossary (v9) with verbatim enum values for roles, capabilities, route_framework, route_kind, client_kind, call_match - Recovery playbook (7 common failure modes -> fix) - Slash-style prompt aliases (/who-calls, /flow, /impact, /diff-risk, …) Engineered for weak / mid models that otherwise pick the wrong tool (e.g. semantic search for exact-call-graph questions, simple names where FQNs are required). docs/MANUAL-VERIFICATION-CHECKLIST.md — agent-driven manual test plan. - 7 phases: index health, roles & capabilities, routes, call graph, cross-service edges, semantic search, brownfield overrides - Each item: ☐ checkbox · copy-paste verification prompt · expected output (calibrated on tests/bank-chat-system, ontology v9) · fix - Pre-flight build instructions and post-completion guidance - Calibration source pinned: master @ d62b48c, 84 files / 92 types / 17 routes / 793 calls / 2 HTTP_CALLS / 5 ASYNC_CALLS README.md — added 'Driving this MCP from an agent' callout linking both docs. Test baseline unchanged: 290 passed, 4 skipped.

Per real-world feedback: weak models burn calls on argument-shape mistakes more than tool-selection mistakes. Two specific failures observed in the wild: 1. Stringified arrays. Agent passes "[\"DTO\",\"ENTITY\"]" instead of ["DTO","ENTITY"] for exclude_roles (and similar list params), tripping FastMCP/Pydantic validation. 2. Overloaded method needles. Agent passes Foo#bar() for an overloaded method that has Foo#bar(String) too, and gets back only the no-arg match — interpreting empty/short results as 'method not used'. New section 'Argument shapes — what the parser actually wants' inserted between the forced reasoning preamble and the decision tree, covering: §A. JSON, not stringified JSON. Right vs wrong table for every list and primitive param. One-line rule: if the schema says list[str], send a JSON array; if it says str, send a JSON string. §B. Method needles. Exact FQN format spec (simple type names only, generics erased, no spaces, <init> for ctors, dot-separated nested types). Verbatim examples copied from the bank-chat-system index. Three needle shapes ranked by precision. Explicit overload-recovery recipe (drop parens to list overloads, or use type FQN to fan out via DECLARES, or recover the FQN via codebase_search). §C. Path templates. The normalised servlet form vs the raw annotation value. With a fallback recipe (list_routes path_prefix → copy path). Also: - Forced reasoning preamble now nudges 'sanity-check arguments before issuing the call'. - find_callers / find_callees tool reference: corrected the example (was com.foo.Bar#baz(java.lang.String) — wrong, parens take simple type names) and cross-linked to §B. - Recovery playbook: 3 new rows (overload mismatch, stringified JSON, raw path_template). - Slash aliases: /who-calls and /calls-from now explicitly require fqn-with-sig and document the simple-name fallback. Test baseline unchanged: 290 passed, 4 skipped.

HumanBean17 · 2026-05-06T10:16:24Z

Follow-up: argument-shape failures (`exclude_roles` stringified, overloaded method needles)

Real-world feedback after the first attempt on a real Java project — two specific failure modes that the guide didn't call out clearly:

1. Stringified arrays

Agent passes "[\"DTO\",\"ENTITY\"]" instead of ["DTO","ENTITY"] for exclude_roles. FastMCP/Pydantic rejects it. Weak models over-quote defensively.

2. Overloaded method needles

Agent calls find_callees({"fqn_or_signature":"Foo#bar()"}) for an overloaded method where the real call site is Foo#bar(String). Gets the no-arg variant only (or empty). Interprets the empty result as "method not used" and moves on. Not a bug — the resolver is doing its job — but the guide didn't explain how to recover.

Fix in commit `1be95e1`

New section "Argument shapes — what the parser actually wants" between the forced reasoning preamble and the decision tree, with three subsections:

§A. JSON, not stringified JSON

Right-vs-wrong table for every problem param:

Param	✅ Right	❌ Wrong
`exclude_roles`	`["DTO","ENTITY","CONFIG","OTHER"]`	`"[\"DTO\",\"ENTITY\",\"CONFIG\",\"OTHER\"]"`
`edge_types`	`["EXTENDS","IMPLEMENTS"]`	`"EXTENDS,IMPLEMENTS"`
`confirm` / `limit` / `min_confidence`	`true` / `20` / `0.9`	`"true"` / `"20"` / `"0.9"`
string enums	`"CONTROLLER"`	`["CONTROLLER"]` (single value, not list)

Plus the one-line rule: if the schema says list[str], send a JSON array; if it says str, send a JSON string.

§B. Method needles — FQN + signature, with simple type names

Exact format spec verified against the live bank-chat-system index:

<package>.<Type>[.<NestedType>]#<methodName>(<SimpleType1>,<SimpleType2>,…)

Key rules surfaced explicitly:

Simple type names only (String, not java.lang.String)
Generics erased (List<String> → List)
No spaces between commas
<init> for constructors
Nested types dot-separated under the outer type

Verbatim calibration examples lifted from the fixture's actual stored FQNs.

Overload-recovery recipe — three options when you don't know the exact signature:

Drop the parens entirely → simple-name lookup matches all overloads → pick the right one and re-query with full FQN+sig
Pass the type FQN → fans out to every method via DECLARES
codebase_search with auto_hybrid=true → recover the exact stored FQN

Plus a "How to find the FQN you need" subsection (how to extract FQNs from codebase_search/list_by_role/find_implementors results) and a warning to never pass phantom-row needles like ?HashMap<>#<init>(0).

§C. Path templates — the normalised servlet form

Right-vs-wrong table for get_route_by_path / find_route_callers path_template arg, including the concatenation rule for class-level @RequestMapping + method-level @GetMapping.

Other touch-ups

Forced reasoning preamble now nudges "sanity-check arguments before issuing the call" — most weak-model failures here are wrong-argument-shape, not wrong-tool-choice.
find_callers / find_callees tool reference: fixed an incorrect example (was com.foo.Bar#baz(java.lang.String) — wrong, parens take simple types only) and cross-linked to §B.
Recovery playbook: 3 new rows mapping these failure symptoms straight to fixes.
Slash aliases: /who-calls and /calls-from now explicitly say fqn-with-sig and document the simple-name fallback inline.

Diff

docs/AGENT-GUIDE.md | 134 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 127 insertions(+), 7 deletions(-)

Test baseline unchanged: 290 passed, 4 skipped.

Calibration source for the FQN examples: tests/bank-chat-system index built on master @ d62b48c.

HumanBean17 added 2 commits May 6, 2026 09:20

HumanBean17 merged commit ea353c6 into master May 6, 2026

HumanBean17 mentioned this pull request May 6, 2026

propose: v2 brownfield annotations — split @CodebaseRoute, enum-typed kinds #36

Merged

HumanBean17 deleted the docs/agent-guide-and-verification branch May 10, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: agent guide + manual verification checklist (calibrated on bank-chat-system)#35

docs: agent guide + manual verification checklist (calibrated on bank-chat-system)#35
HumanBean17 merged 2 commits into
masterfrom
docs/agent-guide-and-verification

HumanBean17 commented May 6, 2026

Uh oh!

HumanBean17 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HumanBean17 commented May 6, 2026

Summary

docs/AGENT-GUIDE.md (387 lines)

docs/MANUAL-VERIFICATION-CHECKLIST.md (636 lines)

README.md (+11 lines)

Calibration source

Tests

Future work (not in this PR)

Uh oh!

HumanBean17 commented May 6, 2026

Follow-up: argument-shape failures (exclude_roles stringified, overloaded method needles)

1. Stringified arrays

2. Overloaded method needles

Fix in commit 1be95e1

§A. JSON, not stringified JSON

§B. Method needles — FQN + signature, with simple type names

§C. Path templates — the normalised servlet form

Other touch-ups

Diff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`docs/AGENT-GUIDE.md` (387 lines)

`docs/MANUAL-VERIFICATION-CHECKLIST.md` (636 lines)

`README.md` (+11 lines)

Follow-up: argument-shape failures (`exclude_roles` stringified, overloaded method needles)

Fix in commit `1be95e1`