Skip to content

docs: agent guide + manual verification checklist (calibrated on bank-chat-system)#35

Merged
HumanBean17 merged 2 commits into
masterfrom
docs/agent-guide-and-verification
May 6, 2026
Merged

docs: agent guide + manual verification checklist (calibrated on bank-chat-system)#35
HumanBean17 merged 2 commits into
masterfrom
docs/agent-guide-and-verification

Conversation

@HumanBean17
Copy link
Copy Markdown
Owner

Summary

Two new docs to make this MCP usable on real enterprise Java projects, especially with weaker LLMs (Qwen Code, Sonnet-mini, etc.) that previously got confused selecting and invoking tools — leveraging the work done in PRs #25#33.

docs/AGENT-GUIDE.md (387 lines)

Copy-paste-into-QWEN.md / CLAUDE.md block (delimited with <!-- BEGIN/END user-rag MCP guide --> markers) containing:

  • Forced reasoning preamble — every tool call gets a one-line Q-class: + Pick: + Why: prefix. Cheap scaffolding that dramatically improves tool selection on mid-sized models.
  • Decision tree — user intent → first tool → typical follow-up. Headlines: graph beats vector for exact questions; vector beats graph for fuzzy ones.
  • Full reference for all 22 MCP tools — grouped Search · Symbols · Routes · Calls · Roles/Caps · Behavioural · Index mgmt. Each tool: 1-line purpose, required args, common mistakes (⚠), 1 minimal example.
  • Ontology glossary (v9) — verbatim enum values for roles, capabilities, VALID_ROUTE_FRAMEWORKS, VALID_ROUTE_KINDS, VALID_CLIENT_KINDS, VALID_CALL_MATCHES, sourced from java_ontology.py.
  • Recovery playbook — 7 common failure modes (zero callers, missing route, wrong role, no cross-service edges, …) mapped to concrete fixes.
  • Slash-style aliases/who-calls, /calls-from, /route, /handler, /who-hits, /why-no-route, /role-of, /impact, /cross-service, /flow, /diff-risk, /health. Just prompt templates, not real commands — but they reliably steer weak models.

docs/MANUAL-VERIFICATION-CHECKLIST.md (636 lines)

Agent-driven manual test plan to run after indexing your real project. 7 phases, ~24 items total:

Phase Items Focus
1. Index health 4 ontology_version=9, parse error rate, symbol counts, LanceDB tables
2. Roles & capabilities 5 Controllers, services, Feign clients, listeners/producers, OTHER fraction
3. Routes 4 Framework distribution, controllers expose routes, paths/methods non-empty, Kafka topics
4. Call graph 3 find_callers matches IDE "Find Usages", end-to-end chain via find_callees, phantom rate
5. Cross-service edges 3 HTTP_CALLS resolution, ASYNC_CALLS producer↔listener, cross_service_resolution flag
6. Semantic search 2 Concept queries return relevant chunks, auto_hybrid for identifiers
7. Brownfield overrides 3 @CodebaseRole flips role, @CodebaseRoute registers route, @CodebaseClient creates outbound edge

Each item has:

  • ☐ checkbox
  • Copy-paste verification prompt (the agent calls MCP tools and reports back)
  • Expected (calibration) column with concrete numbers from tests/bank-chat-system
  • If failing → fix pointer

Plus pre-flight (build the graph), per-phase "red flags" sections, and post-completion guidance.

README.md (+11 lines)

Added "Driving this MCP from an agent" callout right under the existing "Tuning for your codebase" callout, linking both new docs.

Calibration source

tests/bank-chat-system indexed with master @ d62b48c (post PR-H1, ontology version 9):

ontology_version: 9
counts: 84 files, 92 types, 474 members, 17 routes, 793 calls, 2 HTTP_CALLS, 5 ASYNC_CALLS
parse_errors: 0
routes_by_framework: {spring_mvc: 9, kafka: 2}
roles: CONTROLLER=5, SERVICE=7, COMPONENT=22, ENTITY=10, DTO=9, CONFIG=4, OTHER=43
capabilities: MESSAGE_LISTENER=2, MESSAGE_PRODUCER=2, SCHEDULED_TASK=1
microservices: chat-core (55 types), chat-assign (31 types)

Tests

290 passed, 4 skipped in 52.56s — no code change, baseline unchanged.

Future work (not in this PR)

  • Once usage patterns shake out from real enterprise project rollouts, split AGENT-GUIDE.md into per-skill files (skills/codebase-search.md, skills/route-investigation.md, …) that an orchestrator pulls in on demand. The monolith is the right starting point — easier to copy-paste into a single QWEN.md.
  • Consider auto-generating the tool reference section from server.py introspection (the name=, description=, and Field() blocks) so it never drifts from the actual tool surface.

…-chat-system)

Two new docs targeting real-world rollout on enterprise Java projects:

docs/AGENT-GUIDE.md — copy-paste-into-QWEN.md/CLAUDE.md block.
  - Forced reasoning preamble (Q-class + Pick + Why on every call)
  - Decision tree mapping user intent to first tool
  - Full reference for all 22 MCP tools with required args, common
    mistakes, minimal examples
  - Ontology glossary (v9) with verbatim enum values for roles,
    capabilities, route_framework, route_kind, client_kind, call_match
  - Recovery playbook (7 common failure modes -> fix)
  - Slash-style prompt aliases (/who-calls, /flow, /impact, /diff-risk, …)

  Engineered for weak / mid models that otherwise pick the wrong tool
  (e.g. semantic search for exact-call-graph questions, simple names
  where FQNs are required).

docs/MANUAL-VERIFICATION-CHECKLIST.md — agent-driven manual test plan.
  - 7 phases: index health, roles & capabilities, routes, call graph,
    cross-service edges, semantic search, brownfield overrides
  - Each item: ☐ checkbox · copy-paste verification prompt · expected
    output (calibrated on tests/bank-chat-system, ontology v9) · fix
  - Pre-flight build instructions and post-completion guidance
  - Calibration source pinned: master @ d62b48c, 84 files / 92 types /
    17 routes / 793 calls / 2 HTTP_CALLS / 5 ASYNC_CALLS

README.md — added 'Driving this MCP from an agent' callout linking
both docs.

Test baseline unchanged: 290 passed, 4 skipped.
Per real-world feedback: weak models burn calls on argument-shape
mistakes more than tool-selection mistakes. Two specific failures
observed in the wild:

1. Stringified arrays. Agent passes "[\"DTO\",\"ENTITY\"]" instead of
   ["DTO","ENTITY"] for exclude_roles (and similar list params),
   tripping FastMCP/Pydantic validation.

2. Overloaded method needles. Agent passes Foo#bar() for an overloaded
   method that has Foo#bar(String) too, and gets back only the no-arg
   match — interpreting empty/short results as 'method not used'.

New section 'Argument shapes — what the parser actually wants' inserted
between the forced reasoning preamble and the decision tree, covering:

  §A. JSON, not stringified JSON. Right vs wrong table for every list
      and primitive param. One-line rule: if the schema says list[str],
      send a JSON array; if it says str, send a JSON string.

  §B. Method needles. Exact FQN format spec (simple type names only,
      generics erased, no spaces, <init> for ctors, dot-separated nested
      types). Verbatim examples copied from the bank-chat-system index.
      Three needle shapes ranked by precision. Explicit overload-recovery
      recipe (drop parens to list overloads, or use type FQN to fan out
      via DECLARES, or recover the FQN via codebase_search).

  §C. Path templates. The normalised servlet form vs the raw annotation
      value. With a fallback recipe (list_routes path_prefix → copy path).

Also:
- Forced reasoning preamble now nudges 'sanity-check arguments before
  issuing the call'.
- find_callers / find_callees tool reference: corrected the example
  (was com.foo.Bar#baz(java.lang.String) — wrong, parens take simple
  type names) and cross-linked to §B.
- Recovery playbook: 3 new rows (overload mismatch, stringified JSON,
  raw path_template).
- Slash aliases: /who-calls and /calls-from now explicitly require
  fqn-with-sig and document the simple-name fallback.

Test baseline unchanged: 290 passed, 4 skipped.
@HumanBean17
Copy link
Copy Markdown
Owner Author

Follow-up: argument-shape failures (exclude_roles stringified, overloaded method needles)

Real-world feedback after the first attempt on a real Java project — two specific failure modes that the guide didn't call out clearly:

1. Stringified arrays

Agent passes "[\"DTO\",\"ENTITY\"]" instead of ["DTO","ENTITY"] for exclude_roles. FastMCP/Pydantic rejects it. Weak models over-quote defensively.

2. Overloaded method needles

Agent calls find_callees({"fqn_or_signature":"Foo#bar()"}) for an overloaded method where the real call site is Foo#bar(String). Gets the no-arg variant only (or empty). Interprets the empty result as "method not used" and moves on. Not a bug — the resolver is doing its job — but the guide didn't explain how to recover.

Fix in commit 1be95e1

New section "Argument shapes — what the parser actually wants" between the forced reasoning preamble and the decision tree, with three subsections:

§A. JSON, not stringified JSON

Right-vs-wrong table for every problem param:

Param ✅ Right ❌ Wrong
exclude_roles ["DTO","ENTITY","CONFIG","OTHER"] "[\"DTO\",\"ENTITY\",\"CONFIG\",\"OTHER\"]"
edge_types ["EXTENDS","IMPLEMENTS"] "EXTENDS,IMPLEMENTS"
confirm / limit / min_confidence true / 20 / 0.9 "true" / "20" / "0.9"
string enums "CONTROLLER" ["CONTROLLER"] (single value, not list)

Plus the one-line rule: if the schema says list[str], send a JSON array; if it says str, send a JSON string.

§B. Method needles — FQN + signature, with simple type names

Exact format spec verified against the live bank-chat-system index:

<package>.<Type>[.<NestedType>]#<methodName>(<SimpleType1>,<SimpleType2>,…)

Key rules surfaced explicitly:

  • Simple type names only (String, not java.lang.String)
  • Generics erased (List<String>List)
  • No spaces between commas
  • <init> for constructors
  • Nested types dot-separated under the outer type

Verbatim calibration examples lifted from the fixture's actual stored FQNs.

Overload-recovery recipe — three options when you don't know the exact signature:

  1. Drop the parens entirely → simple-name lookup matches all overloads → pick the right one and re-query with full FQN+sig
  2. Pass the type FQN → fans out to every method via DECLARES
  3. codebase_search with auto_hybrid=true → recover the exact stored FQN

Plus a "How to find the FQN you need" subsection (how to extract FQNs from codebase_search/list_by_role/find_implementors results) and a warning to never pass phantom-row needles like ?HashMap<>#<init>(0).

§C. Path templates — the normalised servlet form

Right-vs-wrong table for get_route_by_path / find_route_callers path_template arg, including the concatenation rule for class-level @RequestMapping + method-level @GetMapping.

Other touch-ups

  • Forced reasoning preamble now nudges "sanity-check arguments before issuing the call" — most weak-model failures here are wrong-argument-shape, not wrong-tool-choice.
  • find_callers / find_callees tool reference: fixed an incorrect example (was com.foo.Bar#baz(java.lang.String) — wrong, parens take simple types only) and cross-linked to §B.
  • Recovery playbook: 3 new rows mapping these failure symptoms straight to fixes.
  • Slash aliases: /who-calls and /calls-from now explicitly say fqn-with-sig and document the simple-name fallback inline.

Diff

docs/AGENT-GUIDE.md | 134 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 127 insertions(+), 7 deletions(-)

Test baseline unchanged: 290 passed, 4 skipped.

Calibration source for the FQN examples: tests/bank-chat-system index built on master @ d62b48c.

@HumanBean17 HumanBean17 merged commit ea353c6 into master May 6, 2026
@HumanBean17 HumanBean17 deleted the docs/agent-guide-and-verification branch May 10, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant