Skip to content

Add distributed tracing documentation#37

Open
bramwelt wants to merge 2 commits intomainfrom
feature/LFXV2-613-tracing
Open

Add distributed tracing documentation#37
bramwelt wants to merge 2 commits intomainfrom
feature/LFXV2-613-tracing

Conversation

@bramwelt
Copy link
Contributor

This pull request adds documentation for distributed tracing to the lfx_one/tracing.md file. The update introduces a new guide for setting up OpenTelemetry tracing with Datadog for Go services.

Document OpenTelemetry tracing setup for Go services on the LFX
Kubernetes platform with Datadog Agent integration, including
architecture diagrams, environment variable configuration, sampling
ratios per environment, trace/log correlation using slog-otel handler
and manual injection fallback.

Additionally document HTTP client/server instrumentation via otelhttp
and database tracing via otelsql, span/attribute naming conventions,
Jaeger local access, Datadog APM access, common query patterns for both
UIs, cluster agent architecture, and a troubleshooting section for
missing spans, sampling, and log correlation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Issue: LFXV2-613
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Copilot AI review requested due to automatic review settings March 11, 2026 16:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Walkthrough

Adds a "Distributed Tracing" entry to the README and introduces a new, comprehensive lfx_one/tracing.md documentation file describing OpenTelemetry-based distributed tracing setup, examples, conventions, and troubleshooting for Go services.

Changes

Cohort / File(s) Summary
README Update
lfx_one/README.md
Inserted a new "Distributed Tracing" documentation link under the Documentation section.
New Tracing Documentation
lfx_one/tracing.md
Added a full guide on distributed tracing with OpenTelemetry: architecture, environment variables, Go SDK setup, context propagation (traceparent, baggage, Jaeger), sampling, tracer/log correlation, instrumentation examples for HTTP and DB, access/query patterns for Jaeger/Datadog, troubleshooting, and diagrams.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding distributed tracing documentation. It directly matches the core objective of introducing a new tracing.md documentation file.
Description check ✅ Passed The pull request description is directly related to the changeset, explaining that it adds distributed tracing documentation for OpenTelemetry setup with Datadog for Go services, which aligns with the actual changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/LFXV2-613-tracing

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new documentation guide describing how LFX Go services should set up distributed tracing using OpenTelemetry with Datadog, and links it from the LFX One docs index.

Changes:

  • Introduces lfx_one/tracing.md with architecture, env var configuration, Go examples, naming conventions, and troubleshooting steps for OTEL→Datadog tracing.
  • Updates lfx_one/README.md to include the new Distributed Tracing doc.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
lfx_one/tracing.md New end-to-end guide for OTEL tracing with Datadog, including examples and operational guidance.
lfx_one/README.md Adds a navigation link to the new tracing documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +80 to +88
| Variable | Description | Example |
|---|---|---|
| `OTEL_SERVICE_NAME` | Identifies the service in Datadog | `query-service` |
| `OTEL_SERVICE_VERSION` | Service version shown in traces | `1.2.0` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Datadog Agent OTLP endpoint | `http://$(HOST_IP):4317` |
| `OTEL_EXPORTER_OTLP_PROTOCOL` | Transport protocol | `grpc` |
| `OTEL_PROPAGATORS` | Context propagation formats | `tracecontext,baggage,jaeger` |
| `OTEL_TRACES_SAMPLER` | Sampling strategy | `parentbased_traceidratio` |
| `OTEL_TRACES_SAMPLER_ARG` | Sampling ratio (0.0 - 1.0) | `0.5` |
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown tables in this doc are written with double leading pipes (e.g. || Variable | ... |). GitHub-flavored Markdown treats this as normal text, so the tables won’t render. Use single pipes (| Variable | ... |) and update the other tables in this file that use the same pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +171 to +176
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
jaeger.Jaeger{},
))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample explicitly sets the global propagator in code via otel.SetTextMapPropagator(...), which overrides any OTEL_PROPAGATORS environment configuration discussed earlier. Either remove the code-level propagator setup (and rely on OTEL_PROPAGATORS), or call out clearly that this is an alternative to using the env var (not in addition to it).

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +346
// Record connection pool metrics as span events
otelsql.RegisterDBStatsMetrics(db,
otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelsql.RegisterDBStatsMetrics produces OpenTelemetry metrics instruments for DB stats; it does not “record connection pool metrics as span events”. This also conflicts with the earlier statement that only traces are collected via OTEL. Please adjust the wording (or remove this metrics call if OTEL metrics aren’t intended in this setup).

Suggested change
// Record connection pool metrics as span events
otelsql.RegisterDBStatsMetrics(db,
otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
)

Copilot uses AI. Check for mistakes.
Comment on lines +518 to +521
```bash
kubectl run -it --rm debug --image=alpine --restart=Never -- \
wget -qO- http://<HOST_IP>:4317
```
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The troubleshooting connectivity check uses wget http://<HOST_IP>:4317, but 4317 is an OTLP gRPC port and won’t respond to an HTTP GET, so this can fail even when the port is reachable. Use a TCP-level check (e.g., nc -vz <HOST_IP> 4317) or a gRPC-aware tool (e.g., grpcurl) instead.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lfx_one/tracing.md`:
- Around line 343-346: Update the misleading comment and handle the return
values from otelsql.RegisterDBStatsMetrics: change the comment to say it
"registers connection pool metrics" (not "records ... as span events"), capture
the returned metric.Registration and error (e.g. reg, err :=
otelsql.RegisterDBStatsMetrics(...)), check and handle the error, and store the
registration handle so you can call reg.Unregister() during DB cleanup/shutdown
to avoid leaking the registration.
- Around line 247-254: In logWithTrace, guard reading IDs by checking the span
context validity: call trace.SpanFromContext(ctx) to get the SpanContext (sc)
and use sc.IsValid() before constructing slog.String("trace_id", ...) and
slog.String("span_id", ...); if sc.IsValid() is false, call
slog.InfoContext(ctx, msg, args...) without adding trace/span attributes,
otherwise append the trace/span attrs to args and call slog.InfoContext(ctx,
msg, appendedArgs...). This uses trace.SpanFromContext, SpanContext.IsValid, and
slog.InfoContext to avoid emitting zero-value trace_id/span_id for untraced
contexts.
- Around line 515-524: Replace the unreliable `wget` check against the gRPC OTLP
endpoint with a plain TCP connectivity test: update the debug pod command that
currently runs `wget -qO- http://<HOST_IP>:4317` to instead run `nc -vz
<HOST_IP> 4317` (or `nc -zv` depending on the image) so you only verify TCP
reachability to port 4317; keep references to `OTEL_EXPORTER_OTLP_ENDPOINT`,
`HOST_IP`, and port `4317` so the docs explain that this verifies the Datadog
Agent OTLP gRPC port is reachable (ECONNREFUSED still indicates OTLP ingestion
is disabled).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: edf56dc0-e420-4c12-9b36-97890111a931

📥 Commits

Reviewing files that changed from the base of the PR and between 1e54346 and 87f0620.

📒 Files selected for processing (2)
  • lfx_one/README.md
  • lfx_one/tracing.md

Comment on lines +247 to +254
func logWithTrace(ctx context.Context, msg string, args ...any) {
sc := trace.SpanFromContext(ctx).SpanContext()
attrs := []any{
slog.String("trace_id", sc.TraceID().String()),
slog.String("span_id", sc.SpanID().String()),
}
slog.InfoContext(ctx, msg, append(attrs, args...)...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and verify the code snippet
find . -name "tracing.md" -type f

Repository: linuxfoundation/lfx-engineering

Length of output: 95


🏁 Script executed:

# Check OpenTelemetry Go SDK version and SpanContext.IsValid() availability
# Search for any go.mod or dependency info
fd -e go.mod | head -5

Repository: linuxfoundation/lfx-engineering

Length of output: 57


🌐 Web query:

OpenTelemetry Go SDK SpanContext.IsValid() method documentation

💡 Result:

In the OpenTelemetry Go SDK/API, SpanContext.IsValid() is documented on go.opentelemetry.io/otel/trace as: (pkg.go.dev)

func (sc SpanContext) IsValid() bool

IsValid reports whether the SpanContext is valid; a valid SpanContext has a valid TraceID and SpanID. (pkg.go.dev)

Citations:


🌐 Web query:

go.opentelemetry.io/otel/trace SpanFromContext no-op span behavior

💡 Result:

In the OpenTelemetry Go API, trace.SpanFromContext(ctx):

  • Returns the current Span stored in ctx.
  • If ctx has no span, it returns a non-recording “no-op” Span (i.e., a Span implementation that “performs no operations”). Its methods like SetAttributes, AddEvent, RecordError, SetStatus, End, etc. are safe to call but have no effect. [1][2]

What that “no-op” span looks like:

  • It is non-recording (IsRecording() == false), so any data you try to add is discarded (effectively a no-op). [2]
  • If there is no span in the context, the API behavior is to return an empty non-recording span (with an “all-zero / invalid” SpanContext: zero trace/span IDs, empty tracestate, unsampled flags). [2]

Related nuance (propagation / “SDK not installed” case):

  • Even when the overall API is acting as no-op (e.g., no SDK installed), OpenTelemetry requires that a non-recording span still carries and propagates any existing SpanContext from the parent context; but it will not create new trace IDs in that mode. [2]

Sources:
[1] go.opentelemetry.io/otel/trace package docs for SpanFromContext (pkg.go.dev) (pkg.go.dev)
[2] OpenTelemetry Trace API spec: non-recording span / empty span when none in context; non-recording discards data (opentelemetry.io)

Citations:


🏁 Script executed:

# Read the specific lines of the file to verify the code snippet
sed -n '247,254p' ./lfx_one/tracing.md

Repository: linuxfoundation/lfx-engineering

Length of output: 387


🏁 Script executed:

# Get more context around those lines to understand the section
sed -n '240,260p' ./lfx_one/tracing.md

Repository: linuxfoundation/lfx-engineering

Length of output: 651


Guard the manual fallback when no span is active.

trace.SpanFromContext(ctx) returns a no-op span when the context has no current span, and SpanContext.IsValid() is the supported check before reading IDs. As written, this helper can emit zero-value trace_id/span_id fields for untraced log calls, which makes correlation noisy.

Add a validity check to skip instrumentation when the span context is invalid:

Suggested fix
 func logWithTrace(ctx context.Context, msg string, args ...any) {
     sc := trace.SpanFromContext(ctx).SpanContext()
+    if !sc.IsValid() {
+        slog.InfoContext(ctx, msg, args...)
+        return
+    }
     attrs := []any{
         slog.String("trace_id", sc.TraceID().String()),
         slog.String("span_id", sc.SpanID().String()),
     }
     slog.InfoContext(ctx, msg, append(attrs, args...)...)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func logWithTrace(ctx context.Context, msg string, args ...any) {
sc := trace.SpanFromContext(ctx).SpanContext()
attrs := []any{
slog.String("trace_id", sc.TraceID().String()),
slog.String("span_id", sc.SpanID().String()),
}
slog.InfoContext(ctx, msg, append(attrs, args...)...)
}
func logWithTrace(ctx context.Context, msg string, args ...any) {
sc := trace.SpanFromContext(ctx).SpanContext()
if !sc.IsValid() {
slog.InfoContext(ctx, msg, args...)
return
}
attrs := []any{
slog.String("trace_id", sc.TraceID().String()),
slog.String("span_id", sc.SpanID().String()),
}
slog.InfoContext(ctx, msg, append(attrs, args...)...)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 247 - 254, In logWithTrace, guard reading
IDs by checking the span context validity: call trace.SpanFromContext(ctx) to
get the SpanContext (sc) and use sc.IsValid() before constructing
slog.String("trace_id", ...) and slog.String("span_id", ...); if sc.IsValid() is
false, call slog.InfoContext(ctx, msg, args...) without adding trace/span
attributes, otherwise append the trace/span attrs to args and call
slog.InfoContext(ctx, msg, appendedArgs...). This uses trace.SpanFromContext,
SpanContext.IsValid, and slog.InfoContext to avoid emitting zero-value
trace_id/span_id for untraced contexts.

Comment on lines +343 to +346
// Record connection pool metrics as span events
otelsql.RegisterDBStatsMetrics(db,
otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /root && find . -name "tracing.md" -type f | head -20

Repository: linuxfoundation/lfx-engineering

Length of output: 130


🏁 Script executed:

cat -n lfx_one/tracing.md | sed -n '340,350p'

Repository: linuxfoundation/lfx-engineering

Length of output: 387


🌐 Web query:

otelsql RegisterDBStatsMetrics return values documentation

💡 Result:

In github.com/XSAM/otelsql, RegisterDBStatsMetrics has this signature:

func RegisterDBStatsMetrics(db *sql.DB, opts ...Option) (metric.Registration, error)

Return values:

  • metric.Registration: a handle for the metrics callback registration; call reg.Unregister() when the *sql.DB is no longer used. The README shows this exact usage pattern. [1][2]
  • error: non-nil if otelsql fails to create the DBStats metric instruments or fails to register the callback with the Meter (in those cases, the function returns nil, err). [2]

Sources: pkg.go.dev API doc [1] and the implementation in sql.go [2].


Fix the comment and handle RegisterDBStatsMetrics return values.

Line 343's comment claims the code "records connection pool metrics as span events" which is misleading—RegisterDBStatsMetrics registers metrics, not span events. Additionally, the call drops the returned metric.Registration and error; the registration handle must be retained to call Unregister() for cleanup when the database connection is no longer needed, and the error should be checked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 343 - 346, Update the misleading comment and
handle the return values from otelsql.RegisterDBStatsMetrics: change the comment
to say it "registers connection pool metrics" (not "records ... as span
events"), capture the returned metric.Registration and error (e.g. reg, err :=
otelsql.RegisterDBStatsMetrics(...)), check and handle the error, and store the
registration handle so you can call reg.Unregister() during DB cleanup/shutdown
to avoid leaking the registration.

Comment on lines +515 to +524
Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is set correctly and that `HOST_IP` is
resolving. Run a debug pod on the same node and test connectivity:

```bash
kubectl run -it --rm debug --image=alpine --restart=Never -- \
wget -qO- http://<HOST_IP>:4317
```

The Datadog Agent OTLP gRPC port should be reachable. An ECONNREFUSED error
means OTLP ingestion is not enabled on the agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "tracing.md" -type f

Repository: linuxfoundation/lfx-engineering

Length of output: 95


🏁 Script executed:

if [ -f "lfx_one/tracing.md" ]; then
  echo "=== File exists, checking lines 515-524 ==="
  sed -n '515,524p' "lfx_one/tracing.md"
  echo ""
  echo "=== Context around lines 515-524 ==="
  sed -n '510,530p' "lfx_one/tracing.md"
fi

Repository: linuxfoundation/lfx-engineering

Length of output: 1175


🏁 Script executed:

if [ -f "lfx_one/tracing.md" ]; then
  wc -l "lfx_one/tracing.md"
  echo ""
  echo "=== Searching for OTLP references ==="
  rg -n "OTLP|4317" "lfx_one/tracing.md" | head -20
fi

Repository: linuxfoundation/lfx-engineering

Length of output: 1201


🌐 Web query:

OTLP gRPC port 4317 health check test connectivity

💡 Result:

  • Know what you’re testing: 4317 is the default OTLP/gRPC receiver port (Collector default is localhost:4317, and newer “hardened” defaults bind to localhost unless you set 0.0.0.0). [1][2]

Quick “is the port reachable?” (TCP connectivity)

nc -vz <host> 4317
# or
telnet <host> 4317

This only proves something is listening and networking/firewall is OK.

gRPC-level connectivity (handshake)

If the endpoint exposes gRPC Health Checking Protocol

Use grpc_health_probe:

grpc_health_probe -addr=<host>:4317
# add TLS flags as needed (see tool docs)

grpc_health_probe works only if the server implements the standard gRPC health service. [3]

If the endpoint exposes server reflection

Use grpcurl:

grpcurl -plaintext <host>:4317 list

Or describe OTLP services (example pattern):

grpcurl -plaintext <host>:4317 describe opentelemetry.proto.collector.trace.v1

This requires reflection to be enabled on the server; otherwise grpcurl can’t “list/describe” without you supplying protos. [4]

Practical “health check” for an OpenTelemetry Collector

The Collector’s typical health endpoint is not on 4317; it’s provided by the health_check extension, which exposes an HTTP endpoint (commonly 0.0.0.0:13133). [5]

Example check:

curl -sf http://<host>:13133/ || echo "unhealthy"

Sources: [1] OTLP receiver defaults (localhost:4317) • [2] Collector hardening/default bind addresses • [3] grpc-health-probe tool (gRPC Health Checking Protocol) • [4] grpcurl usage example against OTLP • [5] Collector health_check extension endpoint (0.0.0.0:13133)


🌐 Web query:

gRPC health check wget vs nc tcp probe

💡 Result:

  • nc TCP probe: Only verifies that something is listening on the port (a TCP handshake succeeds). It cannot speak HTTP/2 or send a gRPC request, so it can’t validate gRPC service health—just basic connectivity. (kubernetes.io)
  • wget probe: wget is for HTTP(S) (typically HTTP/1.1). gRPC runs over HTTP/2 with gRPC framing, so wget generally cannot perform a real gRPC health check (unless your app also exposes a separate HTTP health endpoint). (grpc.io)
  • Actual gRPC health check: Implement and query the standard gRPC Health Checking Protocol (grpc.health.v1.Health/Check or Watch). Use purpose-built tools like grpc_health_probe (common in Kubernetes) or grpcurl to call the health service. (grpc.io)

Citations:


🌐 Web query:

Datadog OTLP gRPC ingestion agent health check

💡 Result:

Datadog’s OTLP gRPC ingestion in the Datadog Agent is just a listener (typically port 4317) once OTLP ingestion is enabled. (docs.datadoghq.com)

There isn’t a documented “OTLP gRPC health check” RPC (like grpc.health.v1.Health/Check) for the Agent’s OTLP receiver. If you need a health signal for “can I ingest OTLP right now?”, the usual approach is:

  1. Check the Agent health endpoint (readiness/liveness)
  • The Datadog Helm chart configures the Agent container to expose HTTP probes on a dedicated health port (default 5555) with:

Example:

curl -fsS http://<agent-ip>:5555/ready
  1. Check that the OTLP gRPC port is reachable
  • Simple TCP connect check to 4317 (or whatever you configured):
nc -zv <agent-ip> 4317

If you specifically want a gRPC Health Checking Protocol-style probe, Datadog provides a check that can monitor endpoints that implement that protocol—but the service being checked must actually expose grpc.health.v1. (docs.datadoghq.com)

Citations:


Replace wget with a TCP connectivity check for the gRPC port.

Using wget against the OTLP/gRPC endpoint on port 4317 is unreliable because gRPC runs over HTTP/2 with gRPC framing, not plain HTTP. The wget command will likely fail even when the agent is healthy and the port is reachable, creating false negatives that could mislead troubleshooting. Use nc -vz instead to verify basic TCP connectivity to the gRPC port.

Suggested fix
 kubectl run -it --rm debug --image=alpine --restart=Never -- \
-  wget -qO- http://<HOST_IP>:4317
+  nc -vz <HOST_IP> 4317
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 515 - 524, Replace the unreliable `wget`
check against the gRPC OTLP endpoint with a plain TCP connectivity test: update
the debug pod command that currently runs `wget -qO- http://<HOST_IP>:4317` to
instead run `nc -vz <HOST_IP> 4317` (or `nc -zv` depending on the image) so you
only verify TCP reachability to port 4317; keep references to
`OTEL_EXPORTER_OTLP_ENDPOINT`, `HOST_IP`, and port `4317` so the docs explain
that this verifies the Datadog Agent OTLP gRPC port is reachable (ECONNREFUSED
still indicates OTLP ingestion is disabled).

Align all table columns, clarify semconv usage for additional
span attributes with typed constructors and key constants, and
add a reference table of common semconv attributes by domain.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Issue: LFXV2-613
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
lfx_one/tracing.md (3)

249-255: ⚠️ Potential issue | 🟡 Minor

Guard manual log correlation when no span is active.

Without an sc.IsValid() check, this helper emits zero-value trace_id/span_id fields for untraced contexts.

Suggested fix
 func logWithTrace(ctx context.Context, msg string, args ...any) {
     sc := trace.SpanFromContext(ctx).SpanContext()
+    if !sc.IsValid() {
+        slog.InfoContext(ctx, msg, args...)
+        return
+    }
     attrs := []any{
         slog.String("trace_id", sc.TraceID().String()),
         slog.String("span_id", sc.SpanID().String()),
     }
     slog.InfoContext(ctx, msg, append(attrs, args...)...)
 }
OpenTelemetry Go: what does trace.SpanFromContext(ctx) return when no span is active, and should SpanContext.IsValid() be checked before reading TraceID/SpanID?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 249 - 255, The helper logWithTrace currently
always reads trace_id/span_id from the span context and will emit zero-value IDs
for non-traced contexts; update logWithTrace to call
trace.SpanFromContext(ctx).SpanContext(), check sc.IsValid() and only append
slog.String("trace_id", ...) and slog.String("span_id", ...) to attrs when
sc.IsValid() is true so untraced contexts do not get zero-value trace/span
fields.

345-348: ⚠️ Potential issue | 🟠 Major

This snippet mixes a traces-only guide with OTel metrics registration.

Earlier the doc says LFX only collects traces via OTEL, but this example registers DB stats metrics and never shows a MeterProvider/exporter. That is internally inconsistent and can send readers down a dead end. Either remove this call from the tracing guide, or document the metrics pipeline explicitly; if you keep it, also handle the returned registration/error and unregister on shutdown.

Minimal fix for this guide
-    // Record connection pool metrics as span events
-    otelsql.RegisterDBStatsMetrics(db,
-        otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
-    )
     return db, nil
 }
In github.com/XSAM/otelsql, does RegisterDBStatsMetrics emit OpenTelemetry metrics or span events, and what are its return values?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 345 - 348, The snippet mixes traces-only
guidance with a metrics call: otelsql.RegisterDBStatsMetrics (with
semconv.DBSystemPostgreSQL) registers DB metrics and returns a registration
handle and error, so either remove this call from the tracing guide or
explicitly document the metrics pipeline (provide a MeterProvider/exporter
setup) and handle the return values; if you keep it, capture the returned
registration and error from otelsql.RegisterDBStatsMetrics, check and log/handle
the error, and call the registration’s Unregister (or equivalent) on shutdown to
clean up the metrics registration.

553-559: ⚠️ Potential issue | 🟡 Minor

Use a TCP probe here instead of wget.

4317 is the OTLP/gRPC port, so wget can fail even when the endpoint is reachable. That makes this troubleshooting step noisy and misleading.

Suggested fix
 kubectl run -it --rm debug --image=alpine --restart=Never -- \
-  wget -qO- http://<HOST_IP>:4317
+  nc -vz <HOST_IP> 4317
For an OTLP gRPC endpoint on port 4317, is wget a valid connectivity check, or should a TCP/gRPC-specific probe such as nc or grpc_health_probe be used instead?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 553 - 559, Replace the wget-based
connectivity check in the kubectl debug pod run command with a TCP/gRPC probe
(e.g., use nc/netcat or grpc_health_probe) because port 4317 is OTLP/gRPC and
HTTP probes can return false negatives; update the snippet that currently uses
"kubectl run ... -- wget -qO- http://<HOST_IP>:4317" to instead execute a TCP
connect or gRPC health check so the documentation demonstrates a correct
reachability test for the Datadog Agent OTLP gRPC port.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lfx_one/tracing.md`:
- Around line 24-30: Update the Mermaid diagram and accompanying prose so APM
traces flow directly from app to node agent to Datadog: modify the diagram lines
using the symbols app, agent, and dd to show app -- "OTLP/gRPC" --> agent and
agent -- "APM traces" --> dd (remove cluster_agent from the APM trace path and
any arrow from agent to cluster_agent or cluster_agent to dd for traces); then
edit the explanation text (lines 33-40) to explicitly state that traces are sent
from the application to the node agent and then directly to Datadog (the Datadog
Cluster Agent is not in the APM trace forwarding path).

---

Duplicate comments:
In `@lfx_one/tracing.md`:
- Around line 249-255: The helper logWithTrace currently always reads
trace_id/span_id from the span context and will emit zero-value IDs for
non-traced contexts; update logWithTrace to call
trace.SpanFromContext(ctx).SpanContext(), check sc.IsValid() and only append
slog.String("trace_id", ...) and slog.String("span_id", ...) to attrs when
sc.IsValid() is true so untraced contexts do not get zero-value trace/span
fields.
- Around line 345-348: The snippet mixes traces-only guidance with a metrics
call: otelsql.RegisterDBStatsMetrics (with semconv.DBSystemPostgreSQL) registers
DB metrics and returns a registration handle and error, so either remove this
call from the tracing guide or explicitly document the metrics pipeline (provide
a MeterProvider/exporter setup) and handle the return values; if you keep it,
capture the returned registration and error from otelsql.RegisterDBStatsMetrics,
check and log/handle the error, and call the registration’s Unregister (or
equivalent) on shutdown to clean up the metrics registration.
- Around line 553-559: Replace the wget-based connectivity check in the kubectl
debug pod run command with a TCP/gRPC probe (e.g., use nc/netcat or
grpc_health_probe) because port 4317 is OTLP/gRPC and HTTP probes can return
false negatives; update the snippet that currently uses "kubectl run ... -- wget
-qO- http://<HOST_IP>:4317" to instead execute a TCP connect or gRPC health
check so the documentation demonstrates a correct reachability test for the
Datadog Agent OTLP gRPC port.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 390fdaa5-9e5f-438d-af2e-2be2fb989744

📥 Commits

Reviewing files that changed from the base of the PR and between 87f0620 and 26a5ca7.

📒 Files selected for processing (1)
  • lfx_one/tracing.md

Comment on lines +24 to +30
cluster_agent["Datadog Cluster Agent"]
end
dd["Datadog"]

app -- "OTLP/gRPC" --> agent
agent -- "APM traces" --> cluster_agent
cluster_agent -- "Datadog API" --> dd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does the Datadog Cluster Agent in Kubernetes receive/forward APM traces from Datadog Node Agents, or do Node Agents send traces directly to Datadog?

💡 Result:

In Kubernetes, APM traces do not flow through the Datadog Cluster Agent.

  • Application tracers send traces to the node-based Datadog Agent (DaemonSet) (via DD_AGENT_HOST / port 8126 or via the UDS socket), and that node agent sends the traces to Datadog. [1]
  • Datadog’s own Cluster Agent architecture explicitly describes the node agent as the component that “sends … APM traces … to the Datadog SaaS platform,” while the Cluster Agent focuses on cluster-level data/metadata. [2]
  • The Admission Controller is part of the Cluster Agent and can help auto-configure pods to talk to the node agent, but it’s not a trace forwarder. [1]

Answer: Node Agents send traces directly to Datadog; the Cluster Agent is not in the APM trace forwarding path. [1][2]

Sources
[1] Datadog Docs – Kubernetes APM: Trace Collection (docs.datadoghq.com)
[2] Datadog – Real-world applications of the Datadog Cluster Agent (Part 1) (datadoghq.com)

Citations:


Correct the trace routing architecture in the diagram and explanation.

The Datadog Cluster Agent is not in the APM trace forwarding path. Node agents send traces directly to Datadog, not through the Cluster Agent. Update both the Mermaid diagram (lines 24-30) and the prose explanation (lines 33-40) to show: app → node agent → Datadog.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 24 - 30, Update the Mermaid diagram and
accompanying prose so APM traces flow directly from app to node agent to
Datadog: modify the diagram lines using the symbols app, agent, and dd to show
app -- "OTLP/gRPC" --> agent and agent -- "APM traces" --> dd (remove
cluster_agent from the APM trace path and any arrow from agent to cluster_agent
or cluster_agent to dd for traces); then edit the explanation text (lines 33-40)
to explicitly state that traces are sent from the application to the node agent
and then directly to Datadog (the Datadog Cluster Agent is not in the APM trace
forwarding path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants