Skip to content

Add foundational webhook package (Phase 1 of dynamic webhook middleware)#3840

Open
Sanskarzz wants to merge 8 commits intostacklok:mainfrom
Sanskarzz:dynamicwebhook1
Open

Add foundational webhook package (Phase 1 of dynamic webhook middleware)#3840
Sanskarzz wants to merge 8 commits intostacklok:mainfrom
Sanskarzz:dynamicwebhook1

Conversation

@Sanskarzz
Copy link
Contributor

@Sanskarzz Sanskarzz commented Feb 16, 2026

Fix: #3396

Summary

Implements Phase 1 of RFC THV-0017, creating the core pkg/webhook/ package with types, HTTP client, HMAC signing, and error handling.

ToolHive currently has no mechanism for external policy enforcement or request transformation. This PR lays the foundation for the dynamic webhook middleware system that will allow operators to configure validating and mutating webhooks for MCP request interception.

Changes

File Description
types.go Core types: Type, FailurePolicy, Config (with validation), Request/Response/MutatingResponse, TLSConfig, Principal, RequestContext
errors.go Error hierarchy: WebhookError base type, TimeoutError, NetworkError, InvalidResponseError — all with Unwrap() for errors.Is/errors.As
signing.go HMAC-SHA256 payload signing (SignPayload) and constant-time verification (VerifySignature)
client.go HTTP client with TLS/mTLS support, configurable timeouts, HMAC header injection, response size limiting (1MB), and error classification
types_test.go Config validation tests (12 cases)
errors_test.go Error type, message, and unwrap chain tests
signing_test.go Signing round-trip, tamper detection, malformed signature, and determinism tests
client_test.go HTTP client tests using httptest: validating/mutating responses, HMAC headers, timeouts, response size limits, content type

Design decisions

  • No stutter naming: Types are webhook.Config, webhook.Type, etc. per project lint rules (not webhook.WebhookConfig)
  • Follows existing patterns: Error types follow pkg/config/errors.go (struct with Unwrap()), HTTP client follows pkg/networking/http_client.go (transport setup, TLS config, timeouts)
  • Uses errors.As for error type inspection (not type assertions), matching project convention
  • Signature format: sha256=hex(HMAC-SHA256(secret, "timestamp.payload")) with X-ToolHive-Signature and X-ToolHive-Timestamp headers per RFC spec

Testing

  • 22 tests pass with -race flag
  • 0 lint issues from golangci-lint

Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@Sanskarzz Sanskarzz requested a review from JAORMX as a code owner February 16, 2026 19:35
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Feb 16, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 92.55319% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.14%. Comparing base (0108b11) to head (c219ecb).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/webhook/client.go 91.22% 6 Missing and 4 partials ⚠️
pkg/webhook/errors.go 93.54% 1 Missing and 1 partial ⚠️
pkg/webhook/types.go 92.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3840      +/-   ##
==========================================
+ Coverage   68.77%   69.14%   +0.36%     
==========================================
  Files         473      477       +4     
  Lines       47919    48074     +155     
==========================================
+ Hits        32955    33239     +284     
+ Misses      12299    12253      -46     
+ Partials     2665     2582      -83     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Feb 17, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Feb 17, 2026
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Feb 17, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Feb 18, 2026
Copy link
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sanskarzz, thanks for putting this together! The foundation looks solid -- clean package structure, good error hierarchy with proper Unwrap() chains, and the HMAC signing implementation is well done. Let's dig into a few things though.

The main blocker is the ValidatingTransport not being applied when TLS config is nil. That leaves the SSRF protections only active in the TLS branch, which doesn't match the #nosec comment's claim. The fix is straightforward (see inline comment), and we should also think about how to handle in-cluster HTTP webhooks (e.g., http://service.namespace.svc.cluster.local) where TLS isn't practical -- probably a configurable opt-in.

Other things to address:

  • t.Context() instead of context.Background() in tests (project convention)
  • classifyError missing context.DeadlineExceeded handling
  • webhookType stored but never read

Nice-to-haves / things to think about for follow-ups:

  • time.Duration JSON serialization (will matter when config integration lands)
  • Principal.Claims as map[string]any for real-world JWT compatibility
  • HTTP 422 distinction per RFC failure mode table
  • Replay protection docs on VerifySignature

The scope here is a solid subset of Phase 1 -- types, client, signing, errors. Looking forward to seeing the middleware integration and config wiring in follow-up PRs!

Comment on lines +152 to +160

// buildTransport creates an http.RoundTripper with the specified TLS configuration,
// wrapped in a ValidatingTransport for security.
func buildTransport(tlsCfg *TLSConfig) (http.RoundTripper, error) {
transport := &http.Transport{
TLSHandshakeTimeout: 10 * time.Second,
ResponseHeaderTimeout: 10 * time.Second,
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tlsCfg == nil, this returns a bare http.Transport without the ValidatingTransport wrapper. That means the SSRF protections (private IP blocking, etc.) only kick in when TLS is explicitly configured.

The #nosec G704 on line 114 claims coverage via ValidatingTransport, but that's only true in the TLS branch.

The fix is to always wrap in ValidatingTransport:

func buildTransport(tlsCfg *TLSConfig) (http.RoundTripper, error) {
	transport := &http.Transport{
		TLSHandshakeTimeout:   10 * time.Second,
		ResponseHeaderTimeout: 10 * time.Second,
		MaxIdleConns:          100,
		MaxIdleConnsPerHost:   10,
		IdleConnTimeout:       90 * time.Second,
	}

	if tlsCfg != nil {
		// ... apply TLS config to transport ...
	}

	allowHTTP := tlsCfg != nil && tlsCfg.InsecureSkipVerify
	return &networking.ValidatingTransport{
		Transport:         transport,
		InsecureAllowHTTP: allowHTTP,
	}, nil
}

Note that the RFC says "TLS/HTTPS required," but in practice folks will want to hit http://my-service.my-namespace.svc.cluster.local for in-cluster webhooks where TLS isn't configured. So we probably want InsecureSkipVerify (or a new AllowHTTP field) to gate HTTP access explicitly, rather than a hard HTTPS requirement. That way the default is secure, but operators running in-cluster can opt in intentionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +85 to +87
return fmt.Errorf("webhook URL is invalid: %w", err)
}
if c.FailurePolicy != FailurePolicyFail && c.FailurePolicy != FailurePolicyIgnore {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the transport comment: url.ParseRequestURI happily accepts http:// URLs. It'd be nice to validate the scheme here so operators get a clear error at config time rather than a confusing runtime rejection from the transport.

That said, we need to account for in-cluster use cases where HTTP to a cluster-local service is reasonable. So the check could be gated on whether InsecureSkipVerify (or a future AllowHTTP flag) is set:

parsed, err := url.ParseRequestURI(c.URL)
if err != nil {
    return fmt.Errorf("webhook URL is invalid: %w", err)
}
if parsed.Scheme != "https" && (c.TLSConfig == nil || !c.TLSConfig.InsecureSkipVerify) {
    return fmt.Errorf("webhook URL must use HTTPS (set insecure_skip_verify for development/in-cluster HTTP)")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

TLSConfig *TLSConfig `json:"tls_config,omitempty"`
// HMACSecretRef is an optional reference to an HMAC secret for payload signing.
HMACSecretRef string `json:"hmac_secret_ref,omitempty"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that time.Duration serializes to JSON as nanoseconds (5000000000 for 5s). The RFC shows timeout: 5s in YAML config, so a human-written duration string won't unmarshal into this field directly.

Not a blocker for this PR, but worth deciding on before the config integration lands. Options: custom type with UnmarshalJSON/UnmarshalYAML, a string field parsed with time.ParseDuration, or documenting that the config layer handles the conversion.

config Config
hmacSecret []byte
webhookType Type
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

webhookType is stored but never read. There's no dispatch logic based on type -- callers just pick Call vs CallMutating directly. Either remove the field or, if the plan is to add a Send method that dispatches based on type, leave a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added todo.

Comment on lines +199 to +206
transport.TLSClientConfig = tlsConfig
return &networking.ValidatingTransport{
Transport: transport,
InsecureAllowHTTP: false,
}, nil
}

// classifyError examines an HTTP client error and returns an appropriately
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a parent context is cancelled or its deadline exceeds, the error may be context.DeadlineExceeded or context.Canceled without wrapping a net.Error. Those would fall through to NetworkError when they're really timeouts/cancellations.

if errors.Is(err, context.DeadlineExceeded) {
    return NewTimeoutError(webhookName, err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines +136 to +139
}

// 5xx errors indicate webhook operational failures.
if resp.StatusCode >= http.StatusInternalServerError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC's failure mode table treats HTTP 422 specially -- it's "Deny (422)" regardless of failure policy. With the current code, 422 gets lumped with all other non-200/non-5xx as InvalidResponseError, so the middleware layer won't be able to distinguish "422 from webhook = always deny" from "400 bad request = follow failure policy."

Consider adding the HTTP status code to the error (e.g., a StatusCode field on InvalidResponseError), or a separate error path for status codes that should bypass failure policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Sub string `json:"sub"`
// Email is the user's email address.
Email string `json:"email,omitempty"`
// Name is the user's display name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JWT claims can carry non-string values (arrays, numbers, booleans). map[string]string works for the RFC examples but might be too restrictive for real-world JWTs. Consider map[string]any to avoid lossy conversion when wiring this up to actual JWT claims downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines +38 to +40
// and payload. The signature should be in the format "sha256=<hex-encoded-signature>".
// Comparison is done in constant time to prevent timing attacks.
func VerifySignature(secret []byte, timestamp int64, payload []byte, signature string) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is exported, webhook server authors will likely reach for it. Worth adding a doc note that callers are responsible for checking timestamp freshness (e.g., reject if older than 5 minutes) to prevent replay attacks.

// Callers should independently verify that the timestamp is recent
// (e.g., within 5 minutes) to prevent replay attacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

name: "non-200 non-5xx response",
serverHandler: func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusBadRequest)
w.Write([]byte("bad request"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project convention: tests should use t.Context() instead of context.Background(). This applies to all the client.Call(context.Background(), req) calls throughout this file (~10 instances). t.Context() gets automatically cancelled when the test ends, which helps catch goroutine leaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 18, 2026
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 22, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Webhook Middleware Phase 1: Core webhook package (types, HTTP client, HMAC signing)

2 participants