feat: replace MCP env-var config with bind-mounted YAML config files#280
feat: replace MCP env-var config with bind-mounted YAML config files#280rshoemaker wants to merge 3 commits intomainfrom
Conversation
Replace environment-variable-based MCP service configuration with
CP-generated YAML config files (config.yaml, tokens.yaml, users.yaml)
bind-mounted into the service container at /app/data/.
config.yaml is CP-owned and regenerated on every update, carrying
database connection, LLM provider, tool toggles, and pool settings.
tokens.yaml and users.yaml are bootstrap-once: written on initial
create if init_token/init_users are provided, then left for the
running MCP server to manage. This separates CP provisioning concerns
from application-level auth management.
Key changes:
- Add MCPServiceConfig typed parser with validation for all supported
config keys (LLM, embedding, pool, tool toggles, bootstrap auth)
- Add MCPConfigResource lifecycle (Create writes all three files,
Update regenerates only config.yaml, Refresh verifies file exists)
- Add token/user file generators with SHA-256 and bcrypt hashing
- Wire MCPConfigResource into the service instance resource chain
between DirResource and ServiceInstanceSpec
- Resolve database host/port from co-located Postgres instance in
plan_update for the config.yaml connection block
- Redact sensitive config keys (API keys, init_users) from API
responses
- Align service user grants with MCP server security guide: public
schema only, pg_read_all_settings, no spock access
- Add TestUpdateMCPServiceConfig E2E test for config update path
PLAT-444
📝 WalkthroughWalkthroughThis PR adds typed MCP service config parsing and validation, generates MCP YAML and auth files, introduces an MCPConfigResource with file-based config lifecycle, updates orchestrator to bind-mount configs (vs env vars), distinguishes bootstrap vs update validation, and adds tests including an e2e in-place update test. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
server/internal/orchestrator/swarm/service_spec_test.go (1)
148-195: Consider addingcheckContainervalidation for resource limits test.This test case sets
DataPath(line 176) but doesn't include acheckContainervalidation function. While the focus is on resource limits, the absence of container spec validation means regressions in the bind mount or command configuration won't be caught when resource limits are applied.Consider adding at least a minimal
checkContainerto verify the bind mount is still correctly configured when resource limits are present.Suggested addition
DataPath: "/var/lib/pgedge/services/db1-mcp-server-host1", }, wantErr: false, + checkContainer: func(t *testing.T, spec *swarm.ContainerSpec) { + // Verify bind mount is present even with resource limits + if len(spec.Mounts) != 1 { + t.Errorf("got %d mounts, want 1", len(spec.Mounts)) + } + }, checkResources: func(t *testing.T, resources *swarm.ResourceRequirements) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_spec_test.go` around lines 148 - 195, Add a minimal checkContainer to the "service with resource limits" test case to validate the bind mount is still configured when resource limits are applied: inside the test case (the one using ServiceContainerSpecOptions with DataPath set), add a checkContainer func that asserts the created container spec is non-nil, has mounts, and includes at least one mount whose Source or Target matches the DataPath value from the test case; keep the existing checkResources intact so both resource limits (checkResources) and container bind-mount validation (checkContainer) run for this case.server/internal/orchestrator/swarm/service_instance_spec.go (1)
62-68: Make the data-dir dependency explicit (and validateDataDirID).
RefreshreadsDirResourcedirectly, butDependencies()relies on an implicit transitive path. Declaring the dir dependency directly makes orchestration order explicit and safer.As per coding guidelines `server/internal/orchestrator/swarm/**/*.go`: Docker Swarm integration should use bind mounts for configuration and data directories.♻️ Proposed refactor
func (s *ServiceInstanceSpecResource) Dependencies() []resource.Identifier { // Service instances depend on the database network, service user role, and MCP config return []resource.Identifier{ NetworkResourceIdentifier(s.DatabaseNetworkID), ServiceUserRoleIdentifier(s.ServiceInstanceID), + filesystem.DirResourceIdentifier(s.DataDirID), MCPConfigResourceIdentifier(s.ServiceInstanceID), } } @@ // Resolve the data directory path from the DirResource + if s.DataDirID == "" { + return fmt.Errorf("data_dir_id is required") + } dataPath, err := filesystem.DirResourceFullPath(rc, s.DataDirID)Also applies to: 95-99
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_instance_spec.go` around lines 62 - 68, Update ServiceInstanceSpecResource to explicitly declare and validate the data-dir dependency: in Dependencies() add the directory resource identifier (use DirResourceIdentifier(s.DataDirID) alongside NetworkResourceIdentifier, ServiceUserRoleIdentifier, MCPConfigResourceIdentifier) so orchestration ordering is explicit; in Refresh() validate that s.DataDirID is present and that the DirResource read (DirResource) is available/valid, returning an error if DataDirID is empty or the DirResource is missing/invalid. Ensure references use the existing symbols ServiceInstanceSpecResource, Dependencies(), Refresh(), DataDirID, DirResourceIdentifier and DirResource.server/internal/database/mcp_service_config.go (1)
84-447: Prefer typed package errors over free-form validation strings.The parser emits many raw string errors, which makes consistent downstream classification/mapping harder. Consider package-level typed/domain errors with field and reason metadata.
As per coding guidelines "Domain-specific errors should be defined in each package; API errors should be mapped to HTTP status codes via Goa".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/database/mcp_service_config.go` around lines 84 - 447, Replace free-form fmt.Errorf validation strings with package-level typed/domain errors and a small structured error type so callers can inspect field and reason; introduce errors like ErrUnknownKey, ErrMissingField, ErrInvalidType, ErrInvalidValue, ErrDuplicateUsername (or a single ConfigError struct with Field and Reason) and return those from ParseMCPServiceConfig helper functions (validateUnknownKeys, requireString, requireStringForProvider, optionalString, optionalBool, optionalFloat64, optionalInt, parseInitUsers) instead of raw strings; update the returned []error values to wrap/contain the field name and a canonical error kind so downstream code can match on error identity or inspect Field/Reason metadata. Ensure existing call sites (e.g., ParseMCPServiceConfig) still accumulate []error but now receive the typed errors to allow mappings to HTTP codes elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e/service_provisioning_test.go`:
- Around line 871-874: The test currently asserts db.ServiceInstances[0].State
== "running" immediately which can flake due to StoreServiceInstance upserting a
transient "creating" state; change the assertion to poll until the state becomes
"running" (keep the require.Len check) by using a retry helper such as
require.Eventually or a small loop with timeout/tick that reads
db.ServiceInstances and checks ServiceInstances[0].State == "running" to allow
the monitor to converge; reference db.ServiceInstances (and the upsert behavior
from StoreServiceInstance) when locating the assertion to replace.
In `@server/internal/database/mcp_service_config.go`:
- Around line 198-214: The validation currently only checks embedding_model and
embedding_api_key when embeddingProvider is present; add a guard to reject
orphan embedding fields when embeddingProvider is nil: if embeddingProvider ==
nil and (embeddingModel != nil || embeddingAPIKey != nil) append appropriate
errors to errs (e.g., "embedding_model must not be set without
embedding_provider" and "embedding_api_key must not be set without
embedding_provider"). Place this check alongside the existing embedding-provider
cross-validation (referencing embeddingProvider, embeddingModel,
embeddingAPIKey, errs and validEmbeddingProviders) so configs with orphan
embedding fields fail validation early.
In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 99-159: The Create and Update methods in MCPConfigResource assume
r.Config is non-nil and later methods (writeConfigFile, writeTokenFileIfNeeded,
writeUserFileIfNeeded) dereference it; add a guard at the start of both Create
and Update after populateCredentials that returns a clear error if r.Config ==
nil (or skip token/user generation when Config is nil), ensuring writeConfigFile
and the token/user writers never receive a nil config; reference
MCPConfigResource, Create, Update, populateCredentials, writeConfigFile,
writeTokenFileIfNeeded, and writeUserFileIfNeeded when locating where to perform
this nil check.
- Around line 90-94: The refresh currently only checks config.yaml via
readResourceFile(dirPath, "config.yaml") so add validation for all managed files
(e.g., tokens.yaml and users.yaml) before returning success; update the logic
around readResourceFile to iterate a list of required filenames (at minimum
"config.yaml", "tokens.yaml", "users.yaml"), call readResourceFile for each
using the same dirPath, and return a wrapped error that includes the missing
filename when any read fails so the Refresh function fails fast on missing
managed files.
In `@server/internal/orchestrator/swarm/mcp_config_test.go`:
- Around line 11-13: Remove the three unused helper functions float64Ptr,
intPtrMCP, and boolPtrMCP from
server/internal/orchestrator/swarm/mcp_config_test.go to satisfy lint; locate
and delete the function declarations for float64Ptr(f float64) *float64,
intPtrMCP(i int) *int, and boolPtrMCP(b bool) *bool, and run tests/linter to
confirm nothing else references them before committing.
In `@server/internal/orchestrator/swarm/mcp_config.go`:
- Around line 95-97: GenerateMCPConfig currently dereferences params.Config
without checks; add input guards at the start of GenerateMCPConfig to verify
params != nil and params.Config != nil and return a clear typed error (not
panic) when invalid. Update callers/tests if needed to handle the returned
error; reference the function name GenerateMCPConfig and the field access
params.Config when locating the change.
In `@server/internal/orchestrator/swarm/service_spec.go`:
- Around line 87-90: Validate opts.DataPath before constructing mounts: if
opts.DataPath is empty, return a clear error immediately instead of calling
docker.BuildMount. Update the code around where mounts is built (the mounts
variable and call to docker.BuildMount) to perform a check like if opts.DataPath
== "" { return fmt.Errorf("data path required for service mount") } so the
function fails fast with a helpful message rather than allowing Docker to fail
later.
---
Nitpick comments:
In `@server/internal/database/mcp_service_config.go`:
- Around line 84-447: Replace free-form fmt.Errorf validation strings with
package-level typed/domain errors and a small structured error type so callers
can inspect field and reason; introduce errors like ErrUnknownKey,
ErrMissingField, ErrInvalidType, ErrInvalidValue, ErrDuplicateUsername (or a
single ConfigError struct with Field and Reason) and return those from
ParseMCPServiceConfig helper functions (validateUnknownKeys, requireString,
requireStringForProvider, optionalString, optionalBool, optionalFloat64,
optionalInt, parseInitUsers) instead of raw strings; update the returned []error
values to wrap/contain the field name and a canonical error kind so downstream
code can match on error identity or inspect Field/Reason metadata. Ensure
existing call sites (e.g., ParseMCPServiceConfig) still accumulate []error but
now receive the typed errors to allow mappings to HTTP codes elsewhere.
In `@server/internal/orchestrator/swarm/service_instance_spec.go`:
- Around line 62-68: Update ServiceInstanceSpecResource to explicitly declare
and validate the data-dir dependency: in Dependencies() add the directory
resource identifier (use DirResourceIdentifier(s.DataDirID) alongside
NetworkResourceIdentifier, ServiceUserRoleIdentifier,
MCPConfigResourceIdentifier) so orchestration ordering is explicit; in Refresh()
validate that s.DataDirID is present and that the DirResource read (DirResource)
is available/valid, returning an error if DataDirID is empty or the DirResource
is missing/invalid. Ensure references use the existing symbols
ServiceInstanceSpecResource, Dependencies(), Refresh(), DataDirID,
DirResourceIdentifier and DirResource.
In `@server/internal/orchestrator/swarm/service_spec_test.go`:
- Around line 148-195: Add a minimal checkContainer to the "service with
resource limits" test case to validate the bind mount is still configured when
resource limits are applied: inside the test case (the one using
ServiceContainerSpecOptions with DataPath set), add a checkContainer func that
asserts the created container spec is non-nil, has mounts, and includes at least
one mount whose Source or Target matches the DataPath value from the test case;
keep the existing checkResources intact so both resource limits (checkResources)
and container bind-mount validation (checkContainer) run for this case.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
e2e/service_provisioning_test.goserver/internal/api/apiv1/convert.goserver/internal/api/apiv1/validate.goserver/internal/api/apiv1/validate_test.goserver/internal/database/mcp_service_config.goserver/internal/database/mcp_service_config_test.goserver/internal/orchestrator/swarm/mcp_auth_files.goserver/internal/orchestrator/swarm/mcp_auth_files_test.goserver/internal/orchestrator/swarm/mcp_config.goserver/internal/orchestrator/swarm/mcp_config_resource.goserver/internal/orchestrator/swarm/mcp_config_test.goserver/internal/orchestrator/swarm/orchestrator.goserver/internal/orchestrator/swarm/resources.goserver/internal/orchestrator/swarm/service_instance_spec.goserver/internal/orchestrator/swarm/service_spec.goserver/internal/orchestrator/swarm/service_spec_test.goserver/internal/orchestrator/swarm/service_user_role.goserver/internal/workflows/activities/generate_service_instance_resources.goserver/internal/workflows/plan_update.go
| postgres.Statement{SQL: fmt.Sprintf("GRANT SELECT ON ALL TABLES IN SCHEMA public TO %s;", sanitizeIdentifier(r.Username))}, | ||
| postgres.Statement{SQL: fmt.Sprintf("ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO %s;", sanitizeIdentifier(r.Username))}, | ||
| // Allow viewing PostgreSQL configuration via diagnostic tools | ||
| postgres.Statement{SQL: fmt.Sprintf("GRANT pg_read_all_settings TO %s;", sanitizeIdentifier(r.Username))}, |
There was a problem hiding this comment.
Is this the only missing permission on the existing read_only role? We don't document those canned roles, and I don't know that anyone is actually using it, so I'm wondering if we should just add this permission to that role.
There was a problem hiding this comment.
I'm not sure, but it feels like we're going to need service-specific roles since each service will have it's own baseline of required db permissions.
Replace environment-variable-based MCP service configuration with CP-generated YAML config files (config.yaml, tokens.yaml, users.yaml) bind-mounted into the service container at /app/data/.
config.yaml is CP-owned and regenerated on every update, carrying database connection, LLM provider, tool toggles, and pool settings. tokens.yaml and users.yaml are bootstrap-once: written on initial create if init_token/init_users are provided, then left for the running MCP server to manage. This separates CP provisioning concerns from application-level auth management.
Key changes:
PLAT-444 PLAT-462