Skip to content

philtrem/codeagent-indexing-engine

Repository files navigation

CodeAgent Indexing Engine + MCP SERVER

In short: a structured code index for LLM agents — symbol graph, text search, semantic search, all in one SQLite file, served over MCP.

A local code indexing and retrieval engine for C#, TypeScript/React, and Rust codebases, written in Rust. Parses source into a symbol graph (nodes, edges, spans), embeds symbols for semantic search, stores everything in a single SQLite file, and exposes it over MCP so LLM agents and tools can navigate large projects without loading them into context.

The engine combines tree-sitter parsing with compiler-grade analysis (Roslyn for C#, TypeScript Language Service, rust-analyzer for Rust) to build a full symbol graph — classes, methods, call chains, inheritance hierarchies, interface implementations — then makes it searchable by keyword, qualified name, or semantic similarity. It tracks file changes incrementally, detects renames across edits, and keeps the index current without full rebuilds. Everything runs locally in a single SQLite file; nothing leaves your machine.


What it does

  • Parses C#, TypeScript/TSX, and Rust files using tree-sitter into a typed symbol graph (classes, interfaces, methods, properties, components, modules, traits, etc.)
  • Enriches symbols with compiler-grade analysis via Roslyn (C#), TypeScript Language Service, and rust-analyzer (Rust), adding resolved call graphs, inheritance, and implementations
  • Detects renames across edits using git history + token-level fingerprinting (Jaccard similarity)
  • Embeds symbols using LateOn-Code-edge (ColBERT multi-vector, 48-dim per token, ONNX, in-process) for vector similarity search
  • Watches the file system and incrementally re-indexes changed files
  • Detects dead code — finds unused symbols (methods, classes, properties) with no incoming calls, references, or implementations
  • Integrates with Claude Code via 4 lifecycle hooks: context-aware compaction, automatic re-indexing on file edits, subagent orientation, and post-task quality reports
  • Serves 18 tools over MCP (stdio transport) for symbol lookup, graph traversal, full-text search, semantic similarity search, dead code detection, file browsing, and more

Architecture

codeagent-engine/
  crates/
    codeagent-core/       Core library — parsing, graph, storage, retrieval
    codeagent-cli/        Debug CLI (codeagent binary)
    codeagent-mcp/        MCP server (codeagent-mcp binary)
  extractors/
    csharp/               .NET 8 / .NET 10 Roslyn extractor (JSON-RPC over stdio)
    typescript/           Node.js TS Language Service extractor (JSON-RPC over stdio)
    rust/                 Rust extractor — LSP adapter wrapping rust-analyzer (JSON-RPC over stdio)

Storage

Single SQLite file per project. WAL mode, single-writer (dedicated OS thread + mpsc channel), reader pool (r2d2). Schema includes:

Table Purpose
nodes symbols with identity keys, metadata, and content hashes
edges typed relationships (calls, inherits, implements, contains, imports, ...)
node_spans source locations with line ranges and span hashes
fts_nodes FTS5 full-text index over symbol names and signatures
vec_nodes embedding vectors for similarity search
deletion_log journal for hard deletes and rename detection

UUIDs stored as BLOB(16), content hashes as BLOB(32).

Ingest pipeline

File changes flow through the pipeline:

  1. Project detection (find .csproj / package.json / tsconfig.json / Cargo.toml)
  2. Solution prebuild (C# only: generate synthetic .sln, dotnet restore, load Roslyn workspace)
  3. Rename detection (git + fingerprint + symbol-level matching)
  4. Semantic pre-analysis (all IPC before tree-sitter: Roslyn + TS Language Service provide final symbol keys so extraction avoids identity reconciliation)
  5. Syntactic parsing (tree-sitter adapters for C#, TypeScript, and Rust, parallelised via Rayon, using semantic keys from step 4)
  6. Apply semantic edges + attributes (DB writes only, no IPC)
  7. Deletions (hard-delete removed files, journaled)
  8. Semantic context changes (recompute edges when .csproj / tsconfig.json / Cargo.toml changes)

For incremental batches after the initial index, the IPC processes are shut down to reclaim memory. When a subsequent large batch needs semantic analysis, a minimal solution containing only the touched projects is generated and loaded — avoiding the cost of re-loading the full workspace.

Retrieval

Hybrid search combines three channels — semantic similarity (ColBERT two-stage: centroid pre-filter + MaxSim re-rank), keyword matching (BM25 via FTS5), and qualified-name lookup — then merges results via Reciprocal Rank Fusion with configurable boosts for public API surface and reference counts.


MCP tools

The server exposes 18 tools over stdio:

Category Tools
File system list_directory, read_file, get_directory_tree
Search search_symbols (keyword), lookup_symbol (qualified name), find_similar (semantic)
Navigation get_symbol, get_source_spans, get_file_outline, get_callers, get_callees, get_implementations, get_references, get_dependencies, get_dependents, find_dead_code
Management index_files, get_status

All file access is sandboxed to the repository root.


Building

Requires Rust 1.70+ and Cargo.

cd codeagent-engine
cargo build --release

The two binaries end up in target/release/:

  • codeagent — debug CLI
  • codeagent-mcp — MCP server
Optional: language extractors

For semantic enrichment beyond tree-sitter (resolved types, call graphs):

C# (Roslyn) — requires .NET 8 SDK or .NET 10 SDK (or both):

cd extractors/csharp
dotnet build -c Release

The project multi-targets net8.0 and net10.0. Building produces output under both bin/Release/net8.0/ and bin/Release/net10.0/. The engine auto-detects which .NET runtimes are installed and selects the best matching binary at launch.

TypeScript — requires Node.js 18+:

cd extractors/typescript
npm install && npm run build

Rust — requires Rust toolchain and rust-analyzer:

rustup component add rust-analyzer
cd extractors/rust
cargo build --release

Configure extractor paths in .codeagent/config.json:

{
  "indexing": {
    "csharp_extractor_path": "path/to/bin/Release/net8.0/CodeAgentExtractor.dll",
    "typescript_extractor_path": "path/to/dist/index.js",
    "rust_extractor_path": "path/to/extractors/rust/target/release/codeagent-rust-extractor"
  }
}

For the C# extractor, point csharp_extractor_path at any TFM-specific DLL (e.g., the net8.0/ copy). The engine will automatically check sibling TFM directories and pick the one matching your installed .NET runtime — so the same config works whether you have .NET 8, .NET 10, or both installed.

Without extractors, indexing falls back to syntactic-only mode (tree-sitter). You still get symbols, containment, and imports — just not resolved call graphs or interface implementations.


Getting started

cd your-project
codeagent init

This creates .codeagent/ (config, database), adds the DB to .gitignore, and registers 4 Claude Code lifecycle hooks in .claude/settings.json. Then start the MCP server:

codeagent-mcp

Configuration

codeagent init creates .codeagent/config.json with sensible defaults. All fields are optional.

Example config
{
  "indexing": {
    "safe_mode": true,
    "write_debounce_ms": 2000,
    "rename_similarity_threshold": 0.80,
    "follow_symlinks": false
  },
  "embedding": {
    "model_name": "lightonai/LateOn-Code-edge",
    "dimensionality": 48,
    "batch_size": 64,
    "prefilter_k": 100
  },
  "retrieval": {
    "max_output_tokens": 16384,
    "rrf_k": 60
  },
  "mcp": {
    "max_results": 50,
    "max_file_size": 524288
  }
}

Environment variable overrides follow the pattern CODEAGENT_<SECTION>_<KEY> (e.g., CODEAGENT_INDEXING_SAFE_MODE=false).


Claude Code hooks

codeagent init registers four hooks that run automatically during Claude Code sessions:

Hook Trigger What it does
PreCompact Before context compaction Injects a PageRank-ranked table of the 30 most central symbols so they survive compaction
PostToolUse After Edit / Write / NotebookEdit Silently re-indexes the changed file so the graph stays current
SubagentStart When a subagent spawns Provides a project overview (stats, top 15 symbols, available MCP tools)
TaskCompleted When a task finishes Reports potentially unused symbols (dead code) and unresolved references

Hooks communicate over stdin/stdout JSON. Non-blocking errors are logged to stderr; hooks never block Claude Code execution.


CLI

The codeagent binary provides project setup, hook handling, and database inspection:

# Set up a project
codeagent init [--repo-root <path>]

# Database inspection
codeagent --db .codeagent/index.db get-node <uuid>
codeagent --db .codeagent/index.db get-outline <file-id>
codeagent --db .codeagent/index.db filter --query "authenticate" --node-type method
codeagent --db .codeagent/index.db lookup "MyApp.Auth.AuthService"
codeagent --db .codeagent/index.db health

# Hook handlers (called automatically by Claude Code, not manually)
codeagent hook pre-compact
codeagent hook post-tool-use
codeagent hook subagent-start
codeagent hook task-completed

Tests

# Unit and integration tests (417 core + 41 fixture + 31 MCP + 5 CLI)
cargo test

# Rust extractor tests (separate binary, outside workspace)
cd extractors/rust && cargo test

# OSS integration tests — indexes real repos (tRPC, Hot Chocolate GraphQL, rust-analyzer)
# First run clones repos; subsequent runs use cached clones
cargo test -p codeagent-core --features oss-tests --test oss_tests -- --nocapture

# Pipeline benchmarks only (feature-gated)
cargo test -p codeagent-core --features oss-tests test_oss_hc_pipeline_benchmark -- --exact --nocapture
cargo test -p codeagent-core --features oss-tests test_oss_trpc_pipeline_benchmark -- --exact --nocapture
cargo test -p codeagent-core --features oss-tests test_oss_rust_analyzer_pipeline_benchmark -- --exact --nocapture

494 workspace tests plus 27 Rust extractor tests plus 31 OSS integration tests (feature-gated behind --features oss-tests), covering parsing, graph operations, invalidation, rename detection, retrieval, dead code detection, PageRank, MCP tool behavior, CLI init/hooks, end-to-end pipeline benchmarks (initial index, idempotent reindex, touched-file reindex with minimal solution reload), and query benchmarks against real-world codebases (tRPC for TypeScript, Hot Chocolate for C#, rust-analyzer for Rust).


Graph model

Node types File, Module, Project, Class, Interface, Method, Property, Constructor, Type, Component
Edge types Calls, Inherits, Implements, Imports, Overrides, References, Contains, Accepts, Extends
Languages C# (csharp), TypeScript (typescript), Rust (rust)

Symbol identity is stable across edits. The identity key is (language, project_id, symbol_key, symbol_disambiguator) — overload-safe for C# (includes parameter types) and file-scoped for TypeScript (includes a deterministic file ID derived from the path).

About

Code indexing engine for LLM-Based code Agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors