Skip to content

feat: replace MCP env-var config with bind-mounted YAML config files#280

Open
rshoemaker wants to merge 3 commits intomainfrom
feat/PLAT-444/mcp_service_config
Open

feat: replace MCP env-var config with bind-mounted YAML config files#280
rshoemaker wants to merge 3 commits intomainfrom
feat/PLAT-444/mcp_service_config

Conversation

@rshoemaker
Copy link
Contributor

@rshoemaker rshoemaker commented Mar 2, 2026

Replace environment-variable-based MCP service configuration with CP-generated YAML config files (config.yaml, tokens.yaml, users.yaml) bind-mounted into the service container at /app/data/.

config.yaml is CP-owned and regenerated on every update, carrying database connection, LLM provider, tool toggles, and pool settings. tokens.yaml and users.yaml are bootstrap-once: written on initial create if init_token/init_users are provided, then left for the running MCP server to manage. This separates CP provisioning concerns from application-level auth management.

Key changes:

  • Add MCPServiceConfig typed parser with validation for all supported config keys (LLM, embedding, pool, tool toggles, bootstrap auth)
  • Add MCPConfigResource lifecycle (Create writes all three files, Update regenerates only config.yaml, Refresh verifies file exists)
  • Add token/user file generators with SHA-256 and bcrypt hashing
  • Wire MCPConfigResource into the service instance resource chain between DirResource and ServiceInstanceSpec
  • Resolve database host/port from co-located Postgres instance in plan_update for the config.yaml connection block
  • Redact sensitive config keys (API keys, init_users) from API responses
  • Align service user grants with MCP server security guide: public schema only, pg_read_all_settings, no spock access
  • Add TestUpdateMCPServiceConfig E2E test for config update path

PLAT-444 PLAT-462

Replace environment-variable-based MCP service configuration with
CP-generated YAML config files (config.yaml, tokens.yaml, users.yaml)
bind-mounted into the service container at /app/data/.

config.yaml is CP-owned and regenerated on every update, carrying
database connection, LLM provider, tool toggles, and pool settings.
tokens.yaml and users.yaml are bootstrap-once: written on initial
create if init_token/init_users are provided, then left for the
running MCP server to manage. This separates CP provisioning concerns
from application-level auth management.

Key changes:
  - Add MCPServiceConfig typed parser with validation for all supported
    config keys (LLM, embedding, pool, tool toggles, bootstrap auth)
  - Add MCPConfigResource lifecycle (Create writes all three files,
    Update regenerates only config.yaml, Refresh verifies file exists)
  - Add token/user file generators with SHA-256 and bcrypt hashing
  - Wire MCPConfigResource into the service instance resource chain
    between DirResource and ServiceInstanceSpec
  - Resolve database host/port from co-located Postgres instance in
    plan_update for the config.yaml connection block
  - Redact sensitive config keys (API keys, init_users) from API
    responses
  - Align service user grants with MCP server security guide: public
    schema only, pg_read_all_settings, no spock access
  - Add TestUpdateMCPServiceConfig E2E test for config update path

  PLAT-444
@coderabbitai
Copy link

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR adds typed MCP service config parsing and validation, generates MCP YAML and auth files, introduces an MCPConfigResource with file-based config lifecycle, updates orchestrator to bind-mount configs (vs env vars), distinguishes bootstrap vs update validation, and adds tests including an e2e in-place update test.

Changes

Cohort / File(s) Summary
E2E Tests
e2e/service_provisioning_test.go
Adds TestUpdateMCPServiceConfig to exercise in-place MCP config updates; the test function appears duplicated in the diff.
API Validation & Sensitivity
server/internal/api/apiv1/convert.go, server/internal/api/apiv1/validate.go, server/internal/api/apiv1/validate_test.go
Adds isUpdate flag to validation paths, delegates MCP config validation to ParseMCPServiceConfig, and treats init_users as sensitive; updates validation messages and test call sites.
Database MCP Config Parsing
server/internal/database/mcp_service_config.go, server/internal/database/mcp_service_config_test.go
Adds MCPServiceConfig and ParseMCPServiceConfig with extensive validation (required fields, provider-specific rules, init_users bootstrap-only, embedding checks, unknown-key detection) and comprehensive unit tests.
Orchestrator — MCP Config Generation
server/internal/orchestrator/swarm/mcp_config.go, server/internal/orchestrator/swarm/mcp_config_test.go
Adds GenerateMCPConfig and tests to produce MCP server YAML from parsed config, handling defaults, provider creds, embedding, and tool toggles.
Orchestrator — MCP Auth Files
server/internal/orchestrator/swarm/mcp_auth_files.go, server/internal/orchestrator/swarm/mcp_auth_files_test.go
Adds generators for tokens.yaml and users.yaml (SHA‑256 for tokens, bcrypt for passwords) plus empty-file helpers and unit tests validating structure and hashing.
Orchestrator — MCP Config Resource
server/internal/orchestrator/swarm/mcp_config_resource.go
Introduces MCPConfigResource managing config.yaml (always written on create/update), and creating tokens/users files only if missing; integrates with resource lifecycle and credential population.
Orchestrator — Service Integration
server/internal/orchestrator/swarm/orchestrator.go, server/internal/orchestrator/swarm/resources.go, server/internal/orchestrator/swarm/service_instance_spec.go, server/internal/orchestrator/swarm/service_spec.go, server/internal/orchestrator/swarm/service_spec_test.go, server/internal/orchestrator/swarm/service_user_role.go
Parses MCP config early, adds data directory resource and DataDirID wiring, mounts host data path into container, replaces env-based config with file-based bind mount and entrypoint override, and updates service user role creation/grants; registers MCPConfigResource.
Workflows & Plan
server/internal/workflows/activities/generate_service_instance_resources.go, server/internal/workflows/plan_update.go
Runs GenerateServiceInstanceResources activity on host-specific queue and simplifies Postgres port resolution to use internal port 5432 consistently.

Poem

🐰 I munched a config, hashed a token bright,
Wrote users with care into YAML night,
Mounts snug and steady, no envs to roam,
Bootstrap stays special, updates keep home,
Hopping on updates — in place, all right! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: replacing environment-variable-based MCP configuration with bind-mounted YAML config files.
Description check ✅ Passed The PR description provides a comprehensive summary aligned with the required template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/PLAT-444/mcp_service_config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
server/internal/orchestrator/swarm/service_spec_test.go (1)

148-195: Consider adding checkContainer validation for resource limits test.

This test case sets DataPath (line 176) but doesn't include a checkContainer validation function. While the focus is on resource limits, the absence of container spec validation means regressions in the bind mount or command configuration won't be caught when resource limits are applied.

Consider adding at least a minimal checkContainer to verify the bind mount is still correctly configured when resource limits are present.

Suggested addition
 			DataPath:          "/var/lib/pgedge/services/db1-mcp-server-host1",
 		},
 		wantErr: false,
+		checkContainer: func(t *testing.T, spec *swarm.ContainerSpec) {
+			// Verify bind mount is present even with resource limits
+			if len(spec.Mounts) != 1 {
+				t.Errorf("got %d mounts, want 1", len(spec.Mounts))
+			}
+		},
 		checkResources: func(t *testing.T, resources *swarm.ResourceRequirements) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_spec_test.go` around lines 148 -
195, Add a minimal checkContainer to the "service with resource limits" test
case to validate the bind mount is still configured when resource limits are
applied: inside the test case (the one using ServiceContainerSpecOptions with
DataPath set), add a checkContainer func that asserts the created container spec
is non-nil, has mounts, and includes at least one mount whose Source or Target
matches the DataPath value from the test case; keep the existing checkResources
intact so both resource limits (checkResources) and container bind-mount
validation (checkContainer) run for this case.
server/internal/orchestrator/swarm/service_instance_spec.go (1)

62-68: Make the data-dir dependency explicit (and validate DataDirID).

Refresh reads DirResource directly, but Dependencies() relies on an implicit transitive path. Declaring the dir dependency directly makes orchestration order explicit and safer.

♻️ Proposed refactor
 func (s *ServiceInstanceSpecResource) Dependencies() []resource.Identifier {
 	// Service instances depend on the database network, service user role, and MCP config
 	return []resource.Identifier{
 		NetworkResourceIdentifier(s.DatabaseNetworkID),
 		ServiceUserRoleIdentifier(s.ServiceInstanceID),
+		filesystem.DirResourceIdentifier(s.DataDirID),
 		MCPConfigResourceIdentifier(s.ServiceInstanceID),
 	}
 }
@@
 	// Resolve the data directory path from the DirResource
+	if s.DataDirID == "" {
+		return fmt.Errorf("data_dir_id is required")
+	}
 	dataPath, err := filesystem.DirResourceFullPath(rc, s.DataDirID)
As per coding guidelines `server/internal/orchestrator/swarm/**/*.go`: Docker Swarm integration should use bind mounts for configuration and data directories.

Also applies to: 95-99

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_instance_spec.go` around lines 62
- 68, Update ServiceInstanceSpecResource to explicitly declare and validate the
data-dir dependency: in Dependencies() add the directory resource identifier
(use DirResourceIdentifier(s.DataDirID) alongside NetworkResourceIdentifier,
ServiceUserRoleIdentifier, MCPConfigResourceIdentifier) so orchestration
ordering is explicit; in Refresh() validate that s.DataDirID is present and that
the DirResource read (DirResource) is available/valid, returning an error if
DataDirID is empty or the DirResource is missing/invalid. Ensure references use
the existing symbols ServiceInstanceSpecResource, Dependencies(), Refresh(),
DataDirID, DirResourceIdentifier and DirResource.
server/internal/database/mcp_service_config.go (1)

84-447: Prefer typed package errors over free-form validation strings.

The parser emits many raw string errors, which makes consistent downstream classification/mapping harder. Consider package-level typed/domain errors with field and reason metadata.

As per coding guidelines "Domain-specific errors should be defined in each package; API errors should be mapped to HTTP status codes via Goa".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/database/mcp_service_config.go` around lines 84 - 447,
Replace free-form fmt.Errorf validation strings with package-level typed/domain
errors and a small structured error type so callers can inspect field and
reason; introduce errors like ErrUnknownKey, ErrMissingField, ErrInvalidType,
ErrInvalidValue, ErrDuplicateUsername (or a single ConfigError struct with Field
and Reason) and return those from ParseMCPServiceConfig helper functions
(validateUnknownKeys, requireString, requireStringForProvider, optionalString,
optionalBool, optionalFloat64, optionalInt, parseInitUsers) instead of raw
strings; update the returned []error values to wrap/contain the field name and a
canonical error kind so downstream code can match on error identity or inspect
Field/Reason metadata. Ensure existing call sites (e.g., ParseMCPServiceConfig)
still accumulate []error but now receive the typed errors to allow mappings to
HTTP codes elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/service_provisioning_test.go`:
- Around line 871-874: The test currently asserts db.ServiceInstances[0].State
== "running" immediately which can flake due to StoreServiceInstance upserting a
transient "creating" state; change the assertion to poll until the state becomes
"running" (keep the require.Len check) by using a retry helper such as
require.Eventually or a small loop with timeout/tick that reads
db.ServiceInstances and checks ServiceInstances[0].State == "running" to allow
the monitor to converge; reference db.ServiceInstances (and the upsert behavior
from StoreServiceInstance) when locating the assertion to replace.

In `@server/internal/database/mcp_service_config.go`:
- Around line 198-214: The validation currently only checks embedding_model and
embedding_api_key when embeddingProvider is present; add a guard to reject
orphan embedding fields when embeddingProvider is nil: if embeddingProvider ==
nil and (embeddingModel != nil || embeddingAPIKey != nil) append appropriate
errors to errs (e.g., "embedding_model must not be set without
embedding_provider" and "embedding_api_key must not be set without
embedding_provider"). Place this check alongside the existing embedding-provider
cross-validation (referencing embeddingProvider, embeddingModel,
embeddingAPIKey, errs and validEmbeddingProviders) so configs with orphan
embedding fields fail validation early.

In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 99-159: The Create and Update methods in MCPConfigResource assume
r.Config is non-nil and later methods (writeConfigFile, writeTokenFileIfNeeded,
writeUserFileIfNeeded) dereference it; add a guard at the start of both Create
and Update after populateCredentials that returns a clear error if r.Config ==
nil (or skip token/user generation when Config is nil), ensuring writeConfigFile
and the token/user writers never receive a nil config; reference
MCPConfigResource, Create, Update, populateCredentials, writeConfigFile,
writeTokenFileIfNeeded, and writeUserFileIfNeeded when locating where to perform
this nil check.
- Around line 90-94: The refresh currently only checks config.yaml via
readResourceFile(dirPath, "config.yaml") so add validation for all managed files
(e.g., tokens.yaml and users.yaml) before returning success; update the logic
around readResourceFile to iterate a list of required filenames (at minimum
"config.yaml", "tokens.yaml", "users.yaml"), call readResourceFile for each
using the same dirPath, and return a wrapped error that includes the missing
filename when any read fails so the Refresh function fails fast on missing
managed files.

In `@server/internal/orchestrator/swarm/mcp_config_test.go`:
- Around line 11-13: Remove the three unused helper functions float64Ptr,
intPtrMCP, and boolPtrMCP from
server/internal/orchestrator/swarm/mcp_config_test.go to satisfy lint; locate
and delete the function declarations for float64Ptr(f float64) *float64,
intPtrMCP(i int) *int, and boolPtrMCP(b bool) *bool, and run tests/linter to
confirm nothing else references them before committing.

In `@server/internal/orchestrator/swarm/mcp_config.go`:
- Around line 95-97: GenerateMCPConfig currently dereferences params.Config
without checks; add input guards at the start of GenerateMCPConfig to verify
params != nil and params.Config != nil and return a clear typed error (not
panic) when invalid. Update callers/tests if needed to handle the returned
error; reference the function name GenerateMCPConfig and the field access
params.Config when locating the change.

In `@server/internal/orchestrator/swarm/service_spec.go`:
- Around line 87-90: Validate opts.DataPath before constructing mounts: if
opts.DataPath is empty, return a clear error immediately instead of calling
docker.BuildMount. Update the code around where mounts is built (the mounts
variable and call to docker.BuildMount) to perform a check like if opts.DataPath
== "" { return fmt.Errorf("data path required for service mount") } so the
function fails fast with a helpful message rather than allowing Docker to fail
later.

---

Nitpick comments:
In `@server/internal/database/mcp_service_config.go`:
- Around line 84-447: Replace free-form fmt.Errorf validation strings with
package-level typed/domain errors and a small structured error type so callers
can inspect field and reason; introduce errors like ErrUnknownKey,
ErrMissingField, ErrInvalidType, ErrInvalidValue, ErrDuplicateUsername (or a
single ConfigError struct with Field and Reason) and return those from
ParseMCPServiceConfig helper functions (validateUnknownKeys, requireString,
requireStringForProvider, optionalString, optionalBool, optionalFloat64,
optionalInt, parseInitUsers) instead of raw strings; update the returned []error
values to wrap/contain the field name and a canonical error kind so downstream
code can match on error identity or inspect Field/Reason metadata. Ensure
existing call sites (e.g., ParseMCPServiceConfig) still accumulate []error but
now receive the typed errors to allow mappings to HTTP codes elsewhere.

In `@server/internal/orchestrator/swarm/service_instance_spec.go`:
- Around line 62-68: Update ServiceInstanceSpecResource to explicitly declare
and validate the data-dir dependency: in Dependencies() add the directory
resource identifier (use DirResourceIdentifier(s.DataDirID) alongside
NetworkResourceIdentifier, ServiceUserRoleIdentifier,
MCPConfigResourceIdentifier) so orchestration ordering is explicit; in Refresh()
validate that s.DataDirID is present and that the DirResource read (DirResource)
is available/valid, returning an error if DataDirID is empty or the DirResource
is missing/invalid. Ensure references use the existing symbols
ServiceInstanceSpecResource, Dependencies(), Refresh(), DataDirID,
DirResourceIdentifier and DirResource.

In `@server/internal/orchestrator/swarm/service_spec_test.go`:
- Around line 148-195: Add a minimal checkContainer to the "service with
resource limits" test case to validate the bind mount is still configured when
resource limits are applied: inside the test case (the one using
ServiceContainerSpecOptions with DataPath set), add a checkContainer func that
asserts the created container spec is non-nil, has mounts, and includes at least
one mount whose Source or Target matches the DataPath value from the test case;
keep the existing checkResources intact so both resource limits (checkResources)
and container bind-mount validation (checkContainer) run for this case.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b81d259 and 4b66b6a.

📒 Files selected for processing (19)
  • e2e/service_provisioning_test.go
  • server/internal/api/apiv1/convert.go
  • server/internal/api/apiv1/validate.go
  • server/internal/api/apiv1/validate_test.go
  • server/internal/database/mcp_service_config.go
  • server/internal/database/mcp_service_config_test.go
  • server/internal/orchestrator/swarm/mcp_auth_files.go
  • server/internal/orchestrator/swarm/mcp_auth_files_test.go
  • server/internal/orchestrator/swarm/mcp_config.go
  • server/internal/orchestrator/swarm/mcp_config_resource.go
  • server/internal/orchestrator/swarm/mcp_config_test.go
  • server/internal/orchestrator/swarm/orchestrator.go
  • server/internal/orchestrator/swarm/resources.go
  • server/internal/orchestrator/swarm/service_instance_spec.go
  • server/internal/orchestrator/swarm/service_spec.go
  • server/internal/orchestrator/swarm/service_spec_test.go
  • server/internal/orchestrator/swarm/service_user_role.go
  • server/internal/workflows/activities/generate_service_instance_resources.go
  • server/internal/workflows/plan_update.go

postgres.Statement{SQL: fmt.Sprintf("GRANT SELECT ON ALL TABLES IN SCHEMA public TO %s;", sanitizeIdentifier(r.Username))},
postgres.Statement{SQL: fmt.Sprintf("ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO %s;", sanitizeIdentifier(r.Username))},
// Allow viewing PostgreSQL configuration via diagnostic tools
postgres.Statement{SQL: fmt.Sprintf("GRANT pg_read_all_settings TO %s;", sanitizeIdentifier(r.Username))},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only missing permission on the existing read_only role? We don't document those canned roles, and I don't know that anyone is actually using it, so I'm wondering if we should just add this permission to that role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but it feels like we're going to need service-specific roles since each service will have it's own baseline of required db permissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants