Add foundational webhook package (Phase 1 of dynamic webhook middleware)#3840
Add foundational webhook package (Phase 1 of dynamic webhook middleware)#3840Sanskarzz wants to merge 8 commits intostacklok:mainfrom
Conversation
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3840 +/- ##
==========================================
+ Coverage 68.77% 69.14% +0.36%
==========================================
Files 473 477 +4
Lines 47919 48074 +155
==========================================
+ Hits 32955 33239 +284
+ Misses 12299 12253 -46
+ Partials 2665 2582 -83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
JAORMX
left a comment
There was a problem hiding this comment.
Hey @Sanskarzz, thanks for putting this together! The foundation looks solid -- clean package structure, good error hierarchy with proper Unwrap() chains, and the HMAC signing implementation is well done. Let's dig into a few things though.
The main blocker is the ValidatingTransport not being applied when TLS config is nil. That leaves the SSRF protections only active in the TLS branch, which doesn't match the #nosec comment's claim. The fix is straightforward (see inline comment), and we should also think about how to handle in-cluster HTTP webhooks (e.g., http://service.namespace.svc.cluster.local) where TLS isn't practical -- probably a configurable opt-in.
Other things to address:
t.Context()instead ofcontext.Background()in tests (project convention)classifyErrormissingcontext.DeadlineExceededhandlingwebhookTypestored but never read
Nice-to-haves / things to think about for follow-ups:
time.DurationJSON serialization (will matter when config integration lands)Principal.Claimsasmap[string]anyfor real-world JWT compatibility- HTTP 422 distinction per RFC failure mode table
- Replay protection docs on
VerifySignature
The scope here is a solid subset of Phase 1 -- types, client, signing, errors. Looking forward to seeing the middleware integration and config wiring in follow-up PRs!
|
|
||
| // buildTransport creates an http.RoundTripper with the specified TLS configuration, | ||
| // wrapped in a ValidatingTransport for security. | ||
| func buildTransport(tlsCfg *TLSConfig) (http.RoundTripper, error) { | ||
| transport := &http.Transport{ | ||
| TLSHandshakeTimeout: 10 * time.Second, | ||
| ResponseHeaderTimeout: 10 * time.Second, | ||
| MaxIdleConns: 100, | ||
| MaxIdleConnsPerHost: 10, |
There was a problem hiding this comment.
When tlsCfg == nil, this returns a bare http.Transport without the ValidatingTransport wrapper. That means the SSRF protections (private IP blocking, etc.) only kick in when TLS is explicitly configured.
The #nosec G704 on line 114 claims coverage via ValidatingTransport, but that's only true in the TLS branch.
The fix is to always wrap in ValidatingTransport:
func buildTransport(tlsCfg *TLSConfig) (http.RoundTripper, error) {
transport := &http.Transport{
TLSHandshakeTimeout: 10 * time.Second,
ResponseHeaderTimeout: 10 * time.Second,
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
}
if tlsCfg != nil {
// ... apply TLS config to transport ...
}
allowHTTP := tlsCfg != nil && tlsCfg.InsecureSkipVerify
return &networking.ValidatingTransport{
Transport: transport,
InsecureAllowHTTP: allowHTTP,
}, nil
}Note that the RFC says "TLS/HTTPS required," but in practice folks will want to hit http://my-service.my-namespace.svc.cluster.local for in-cluster webhooks where TLS isn't configured. So we probably want InsecureSkipVerify (or a new AllowHTTP field) to gate HTTP access explicitly, rather than a hard HTTPS requirement. That way the default is secure, but operators running in-cluster can opt in intentionally.
| return fmt.Errorf("webhook URL is invalid: %w", err) | ||
| } | ||
| if c.FailurePolicy != FailurePolicyFail && c.FailurePolicy != FailurePolicyIgnore { |
There was a problem hiding this comment.
Related to the transport comment: url.ParseRequestURI happily accepts http:// URLs. It'd be nice to validate the scheme here so operators get a clear error at config time rather than a confusing runtime rejection from the transport.
That said, we need to account for in-cluster use cases where HTTP to a cluster-local service is reasonable. So the check could be gated on whether InsecureSkipVerify (or a future AllowHTTP flag) is set:
parsed, err := url.ParseRequestURI(c.URL)
if err != nil {
return fmt.Errorf("webhook URL is invalid: %w", err)
}
if parsed.Scheme != "https" && (c.TLSConfig == nil || !c.TLSConfig.InsecureSkipVerify) {
return fmt.Errorf("webhook URL must use HTTPS (set insecure_skip_verify for development/in-cluster HTTP)")
}| TLSConfig *TLSConfig `json:"tls_config,omitempty"` | ||
| // HMACSecretRef is an optional reference to an HMAC secret for payload signing. | ||
| HMACSecretRef string `json:"hmac_secret_ref,omitempty"` | ||
| } |
There was a problem hiding this comment.
Note that time.Duration serializes to JSON as nanoseconds (5000000000 for 5s). The RFC shows timeout: 5s in YAML config, so a human-written duration string won't unmarshal into this field directly.
Not a blocker for this PR, but worth deciding on before the config integration lands. Options: custom type with UnmarshalJSON/UnmarshalYAML, a string field parsed with time.ParseDuration, or documenting that the config layer handles the conversion.
| config Config | ||
| hmacSecret []byte | ||
| webhookType Type | ||
| } |
There was a problem hiding this comment.
webhookType is stored but never read. There's no dispatch logic based on type -- callers just pick Call vs CallMutating directly. Either remove the field or, if the plan is to add a Send method that dispatches based on type, leave a TODO.
pkg/webhook/client.go
Outdated
| transport.TLSClientConfig = tlsConfig | ||
| return &networking.ValidatingTransport{ | ||
| Transport: transport, | ||
| InsecureAllowHTTP: false, | ||
| }, nil | ||
| } | ||
|
|
||
| // classifyError examines an HTTP client error and returns an appropriately |
There was a problem hiding this comment.
If a parent context is cancelled or its deadline exceeds, the error may be context.DeadlineExceeded or context.Canceled without wrapping a net.Error. Those would fall through to NetworkError when they're really timeouts/cancellations.
if errors.Is(err, context.DeadlineExceeded) {
return NewTimeoutError(webhookName, err)
}| } | ||
|
|
||
| // 5xx errors indicate webhook operational failures. | ||
| if resp.StatusCode >= http.StatusInternalServerError { |
There was a problem hiding this comment.
The RFC's failure mode table treats HTTP 422 specially -- it's "Deny (422)" regardless of failure policy. With the current code, 422 gets lumped with all other non-200/non-5xx as InvalidResponseError, so the middleware layer won't be able to distinguish "422 from webhook = always deny" from "400 bad request = follow failure policy."
Consider adding the HTTP status code to the error (e.g., a StatusCode field on InvalidResponseError), or a separate error path for status codes that should bypass failure policy.
| Sub string `json:"sub"` | ||
| // Email is the user's email address. | ||
| Email string `json:"email,omitempty"` | ||
| // Name is the user's display name. |
There was a problem hiding this comment.
JWT claims can carry non-string values (arrays, numbers, booleans). map[string]string works for the RFC examples but might be too restrictive for real-world JWTs. Consider map[string]any to avoid lossy conversion when wiring this up to actual JWT claims downstream.
| // and payload. The signature should be in the format "sha256=<hex-encoded-signature>". | ||
| // Comparison is done in constant time to prevent timing attacks. | ||
| func VerifySignature(secret []byte, timestamp int64, payload []byte, signature string) bool { |
There was a problem hiding this comment.
Since this is exported, webhook server authors will likely reach for it. Worth adding a doc note that callers are responsible for checking timestamp freshness (e.g., reject if older than 5 minutes) to prevent replay attacks.
// Callers should independently verify that the timestamp is recent
// (e.g., within 5 minutes) to prevent replay attacks.| name: "non-200 non-5xx response", | ||
| serverHandler: func(w http.ResponseWriter, _ *http.Request) { | ||
| w.WriteHeader(http.StatusBadRequest) | ||
| w.Write([]byte("bad request")) |
There was a problem hiding this comment.
Project convention: tests should use t.Context() instead of context.Background(). This applies to all the client.Call(context.Background(), req) calls throughout this file (~10 instances). t.Context() gets automatically cancelled when the test ends, which helps catch goroutine leaks.
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
Fix: #3396
Summary
Implements Phase 1 of RFC THV-0017, creating the core pkg/webhook/ package with types, HTTP client, HMAC signing, and error handling.
ToolHive currently has no mechanism for external policy enforcement or request transformation. This PR lays the foundation for the dynamic webhook middleware system that will allow operators to configure validating and mutating webhooks for MCP request interception.
Changes
Design decisions
Testing