Add distributed tracing documentation by bramwelt · Pull Request #37 · linuxfoundation/lfx-engineering

bramwelt · 2026-03-11T16:58:58Z

This pull request adds documentation for distributed tracing to the lfx_one/tracing.md file. The update introduces a new guide for setting up OpenTelemetry tracing with Datadog for Go services.

Document OpenTelemetry tracing setup for Go services on the LFX Kubernetes platform with Datadog Agent integration, including architecture diagrams, environment variable configuration, sampling ratios per environment, trace/log correlation using slog-otel handler and manual injection fallback. Additionally document HTTP client/server instrumentation via otelhttp and database tracing via otelsql, span/attribute naming conventions, Jaeger local access, Datadog APM access, common query patterns for both UIs, cluster agent architecture, and a troubleshooting section for missing spans, sampling, and log correlation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-613 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>

coderabbitai · 2026-03-11T16:59:22Z

Walkthrough

Adds a "Distributed Tracing" entry to the README and introduces a new, comprehensive lfx_one/tracing.md documentation file describing OpenTelemetry-based distributed tracing setup, examples, conventions, and troubleshooting for Go services.

Changes

Cohort / File(s)	Summary
README Update `lfx_one/README.md`	Inserted a new "Distributed Tracing" documentation link under the Documentation section.
New Tracing Documentation `lfx_one/tracing.md`	Added a full guide on distributed tracing with OpenTelemetry: architecture, environment variables, Go SDK setup, context propagation (traceparent, baggage, Jaeger), sampling, tracer/log correlation, instrumentation examples for HTTP and DB, access/query patterns for Jaeger/Datadog, troubleshooting, and diagrams.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding distributed tracing documentation. It directly matches the core objective of introducing a new tracing.md documentation file.
Description check	✅ Passed	The pull request description is directly related to the changeset, explaining that it adds distributed tracing documentation for OpenTelemetry setup with Datadog for Go services, which aligns with the actual changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/LFXV2-613-tracing

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a new documentation guide describing how LFX Go services should set up distributed tracing using OpenTelemetry with Datadog, and links it from the LFX One docs index.

Changes:

Introduces lfx_one/tracing.md with architecture, env var configuration, Go examples, naming conventions, and troubleshooting steps for OTEL→Datadog tracing.
Updates lfx_one/README.md to include the new Distributed Tracing doc.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
lfx_one/tracing.md	New end-to-end guide for OTEL tracing with Datadog, including examples and operational guidance.
lfx_one/README.md	Adds a navigation link to the new tracing documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T17:01:26Z

lfx_one/tracing.md

+| Variable | Description | Example |
+|---|---|---|
+| `OTEL_SERVICE_NAME` | Identifies the service in Datadog | `query-service` |
+| `OTEL_SERVICE_VERSION` | Service version shown in traces | `1.2.0` |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | Datadog Agent OTLP endpoint | `http://$(HOST_IP):4317` |
+| `OTEL_EXPORTER_OTLP_PROTOCOL` | Transport protocol | `grpc` |
+| `OTEL_PROPAGATORS` | Context propagation formats | `tracecontext,baggage,jaeger` |
+| `OTEL_TRACES_SAMPLER` | Sampling strategy | `parentbased_traceidratio` |
+| `OTEL_TRACES_SAMPLER_ARG` | Sampling ratio (0.0 - 1.0) | `0.5` |


Markdown tables in this doc are written with double leading pipes (e.g. || Variable | ... |). GitHub-flavored Markdown treats this as normal text, so the tables won’t render. Use single pipes (| Variable | ... |) and update the other tables in this file that use the same pattern.

Copilot · 2026-03-11T17:01:27Z

lfx_one/tracing.md

+    otel.SetTracerProvider(tp)
+    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
+        propagation.TraceContext{},
+        propagation.Baggage{},
+        jaeger.Jaeger{},
+    ))


The sample explicitly sets the global propagator in code via otel.SetTextMapPropagator(...), which overrides any OTEL_PROPAGATORS environment configuration discussed earlier. Either remove the code-level propagator setup (and rely on OTEL_PROPAGATORS), or call out clearly that this is an alternative to using the env var (not in addition to it).

Copilot · 2026-03-11T17:01:27Z

lfx_one/tracing.md

+    // Record connection pool metrics as span events
+    otelsql.RegisterDBStatsMetrics(db,
+        otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
+    )


otelsql.RegisterDBStatsMetrics produces OpenTelemetry metrics instruments for DB stats; it does not “record connection pool metrics as span events”. This also conflicts with the earlier statement that only traces are collected via OTEL. Please adjust the wording (or remove this metrics call if OTEL metrics aren’t intended in this setup).

Suggested change

// Record connection pool metrics as span events

otelsql.RegisterDBStatsMetrics(db,

otelsql.WithAttributes(semconv.DBSystemPostgreSQL),

)

Copilot · 2026-03-11T17:01:27Z

lfx_one/tracing.md

+```bash
+kubectl run -it --rm debug --image=alpine --restart=Never -- \
+  wget -qO- http://<HOST_IP>:4317
+```


The troubleshooting connectivity check uses wget http://<HOST_IP>:4317, but 4317 is an OTLP gRPC port and won’t respond to an HTTP GET, so this can fail even when the port is reachable. Use a TCP-level check (e.g., nc -vz <HOST_IP> 4317) or a gRPC-aware tool (e.g., grpcurl) instead.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lfx_one/tracing.md`:
- Around line 343-346: Update the misleading comment and handle the return
values from otelsql.RegisterDBStatsMetrics: change the comment to say it
"registers connection pool metrics" (not "records ... as span events"), capture
the returned metric.Registration and error (e.g. reg, err :=
otelsql.RegisterDBStatsMetrics(...)), check and handle the error, and store the
registration handle so you can call reg.Unregister() during DB cleanup/shutdown
to avoid leaking the registration.
- Around line 247-254: In logWithTrace, guard reading IDs by checking the span
context validity: call trace.SpanFromContext(ctx) to get the SpanContext (sc)
and use sc.IsValid() before constructing slog.String("trace_id", ...) and
slog.String("span_id", ...); if sc.IsValid() is false, call
slog.InfoContext(ctx, msg, args...) without adding trace/span attributes,
otherwise append the trace/span attrs to args and call slog.InfoContext(ctx,
msg, appendedArgs...). This uses trace.SpanFromContext, SpanContext.IsValid, and
slog.InfoContext to avoid emitting zero-value trace_id/span_id for untraced
contexts.
- Around line 515-524: Replace the unreliable `wget` check against the gRPC OTLP
endpoint with a plain TCP connectivity test: update the debug pod command that
currently runs `wget -qO- http://<HOST_IP>:4317` to instead run `nc -vz
<HOST_IP> 4317` (or `nc -zv` depending on the image) so you only verify TCP
reachability to port 4317; keep references to `OTEL_EXPORTER_OTLP_ENDPOINT`,
`HOST_IP`, and port `4317` so the docs explain that this verifies the Datadog
Agent OTLP gRPC port is reachable (ECONNREFUSED still indicates OTLP ingestion
is disabled).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: edf56dc0-e420-4c12-9b36-97890111a931

📥 Commits

Reviewing files that changed from the base of the PR and between 1e54346 and 87f0620.

📒 Files selected for processing (2)

lfx_one/README.md
lfx_one/tracing.md

coderabbitai · 2026-03-11T17:06:44Z

lfx_one/tracing.md

+func logWithTrace(ctx context.Context, msg string, args ...any) {
+    sc := trace.SpanFromContext(ctx).SpanContext()
+    attrs := []any{
+        slog.String("trace_id", sc.TraceID().String()),
+        slog.String("span_id", sc.SpanID().String()),
+    }
+    slog.InfoContext(ctx, msg, append(attrs, args...)...)
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and verify the code snippet find . -name "tracing.md" -type f

Repository: linuxfoundation/lfx-engineering

Length of output: 95

🏁 Script executed:

# Check OpenTelemetry Go SDK version and SpanContext.IsValid() availability # Search for any go.mod or dependency info fd -e go.mod | head -5

Repository: linuxfoundation/lfx-engineering

Length of output: 57

🌐 Web query:

OpenTelemetry Go SDK SpanContext.IsValid() method documentation

💡 Result:

In the OpenTelemetry Go SDK/API, SpanContext.IsValid() is documented on go.opentelemetry.io/otel/trace as: (pkg.go.dev)

func (sc SpanContext) IsValid() bool

IsValid reports whether the SpanContext is valid; a valid SpanContext has a valid TraceID and SpanID. (pkg.go.dev)

Citations:

1: https://pkg.go.dev/go.opentelemetry.io/otel/trace

2: https://pkg.go.dev/go.opentelemetry.io/otel/trace

🌐 Web query:

go.opentelemetry.io/otel/trace SpanFromContext no-op span behavior

💡 Result:

In the OpenTelemetry Go API, trace.SpanFromContext(ctx):

Returns the current Span stored in ctx.

If ctx has no span, it returns a non-recording “no-op” Span (i.e., a Span implementation that “performs no operations”). Its methods like SetAttributes, AddEvent, RecordError, SetStatus, End, etc. are safe to call but have no effect. [1][2]

What that “no-op” span looks like:

It is non-recording (IsRecording() == false), so any data you try to add is discarded (effectively a no-op). [2]

If there is no span in the context, the API behavior is to return an empty non-recording span (with an “all-zero / invalid” SpanContext: zero trace/span IDs, empty tracestate, unsampled flags). [2]

Related nuance (propagation / “SDK not installed” case):

Even when the overall API is acting as no-op (e.g., no SDK installed), OpenTelemetry requires that a non-recording span still carries and propagates any existing SpanContext from the parent context; but it will not create new trace IDs in that mode. [2]

Sources:
[1] go.opentelemetry.io/otel/trace package docs for SpanFromContext (pkg.go.dev) (pkg.go.dev)
[2] OpenTelemetry Trace API spec: non-recording span / empty span when none in context; non-recording discards data (opentelemetry.io)

Citations:

1: https://pkg.go.dev/go.opentelemetry.io/otel/trace

2: https://opentelemetry.io/docs/specs/otel/trace/api/

🏁 Script executed:

# Read the specific lines of the file to verify the code snippet sed -n '247,254p' ./lfx_one/tracing.md

Repository: linuxfoundation/lfx-engineering

Length of output: 387

🏁 Script executed:

# Get more context around those lines to understand the section sed -n '240,260p' ./lfx_one/tracing.md

Repository: linuxfoundation/lfx-engineering

Length of output: 651

Guard the manual fallback when no span is active.

trace.SpanFromContext(ctx) returns a no-op span when the context has no current span, and SpanContext.IsValid() is the supported check before reading IDs. As written, this helper can emit zero-value trace_id/span_id fields for untraced log calls, which makes correlation noisy.

Add a validity check to skip instrumentation when the span context is invalid:

Suggested fix

func logWithTrace(ctx context.Context, msg string, args ...any) { sc := trace.SpanFromContext(ctx).SpanContext() + if !sc.IsValid() { + slog.InfoContext(ctx, msg, args...) + return + } attrs := []any{ slog.String("trace_id", sc.TraceID().String()), slog.String("span_id", sc.SpanID().String()), } slog.InfoContext(ctx, msg, append(attrs, args...)...) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func logWithTrace(ctx context.Context, msg string, args ...any) {

sc := trace.SpanFromContext(ctx).SpanContext()

attrs := []any{

slog.String("trace_id", sc.TraceID().String()),

slog.String("span_id", sc.SpanID().String()),

}

slog.InfoContext(ctx, msg, append(attrs, args...)...)

}

func logWithTrace(ctx context.Context, msg string, args ...any) {

sc := trace.SpanFromContext(ctx).SpanContext()

if !sc.IsValid() {

slog.InfoContext(ctx, msg, args...)

return

}

attrs := []any{

slog.String("trace_id", sc.TraceID().String()),

slog.String("span_id", sc.SpanID().String()),

}

slog.InfoContext(ctx, msg, append(attrs, args...)...)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lfx_one/tracing.md` around lines 247 - 254, In logWithTrace, guard reading IDs by checking the span context validity: call trace.SpanFromContext(ctx) to get the SpanContext (sc) and use sc.IsValid() before constructing slog.String("trace_id", ...) and slog.String("span_id", ...); if sc.IsValid() is false, call slog.InfoContext(ctx, msg, args...) without adding trace/span attributes, otherwise append the trace/span attrs to args and call slog.InfoContext(ctx, msg, appendedArgs...). This uses trace.SpanFromContext, SpanContext.IsValid, and slog.InfoContext to avoid emitting zero-value trace_id/span_id for untraced contexts.

coderabbitai · 2026-03-11T17:06:44Z

lfx_one/tracing.md

+    // Record connection pool metrics as span events
+    otelsql.RegisterDBStatsMetrics(db,
+        otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /root && find . -name "tracing.md" -type f | head -20

Repository: linuxfoundation/lfx-engineering

Length of output: 130

🏁 Script executed:

cat -n lfx_one/tracing.md | sed -n '340,350p'

Repository: linuxfoundation/lfx-engineering

Length of output: 387

🌐 Web query:

otelsql RegisterDBStatsMetrics return values documentation

💡 Result:

In github.com/XSAM/otelsql, RegisterDBStatsMetrics has this signature:

func RegisterDBStatsMetrics(db *sql.DB, opts ...Option) (metric.Registration, error)

Return values:

metric.Registration: a handle for the metrics callback registration; call reg.Unregister() when the *sql.DB is no longer used. The README shows this exact usage pattern. [1][2]

error: non-nil if otelsql fails to create the DBStats metric instruments or fails to register the callback with the Meter (in those cases, the function returns nil, err). [2]

Sources: pkg.go.dev API doc [1] and the implementation in sql.go [2].

Fix the comment and handle RegisterDBStatsMetrics return values.

Line 343's comment claims the code "records connection pool metrics as span events" which is misleading—RegisterDBStatsMetrics registers metrics, not span events. Additionally, the call drops the returned metric.Registration and error; the registration handle must be retained to call Unregister() for cleanup when the database connection is no longer needed, and the error should be checked.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lfx_one/tracing.md` around lines 343 - 346, Update the misleading comment and handle the return values from otelsql.RegisterDBStatsMetrics: change the comment to say it "registers connection pool metrics" (not "records ... as span events"), capture the returned metric.Registration and error (e.g. reg, err := otelsql.RegisterDBStatsMetrics(...)), check and handle the error, and store the registration handle so you can call reg.Unregister() during DB cleanup/shutdown to avoid leaking the registration.

coderabbitai · 2026-03-11T17:06:44Z

lfx_one/tracing.md

+Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is set correctly and that `HOST_IP` is
+resolving. Run a debug pod on the same node and test connectivity:
+
+```bash
+kubectl run -it --rm debug --image=alpine --restart=Never -- \
+  wget -qO- http://<HOST_IP>:4317
+```
+
+The Datadog Agent OTLP gRPC port should be reachable. An ECONNREFUSED error
+means OTLP ingestion is not enabled on the agent.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "tracing.md" -type f

Repository: linuxfoundation/lfx-engineering

Length of output: 95

🏁 Script executed:

if [ -f "lfx_one/tracing.md" ]; then echo "=== File exists, checking lines 515-524 ===" sed -n '515,524p' "lfx_one/tracing.md" echo "" echo "=== Context around lines 515-524 ===" sed -n '510,530p' "lfx_one/tracing.md" fi

Repository: linuxfoundation/lfx-engineering

Length of output: 1175

🏁 Script executed:

if [ -f "lfx_one/tracing.md" ]; then wc -l "lfx_one/tracing.md" echo "" echo "=== Searching for OTLP references ===" rg -n "OTLP|4317" "lfx_one/tracing.md" | head -20 fi

Repository: linuxfoundation/lfx-engineering

Length of output: 1201

🌐 Web query:

OTLP gRPC port 4317 health check test connectivity

💡 Result:

Know what you’re testing: 4317 is the default OTLP/gRPC receiver port (Collector default is localhost:4317, and newer “hardened” defaults bind to localhost unless you set 0.0.0.0). [1][2]

Quick “is the port reachable?” (TCP connectivity)

nc -vz <host> 4317 # or telnet <host> 4317

This only proves something is listening and networking/firewall is OK.

gRPC-level connectivity (handshake)

If the endpoint exposes gRPC Health Checking Protocol

Use grpc_health_probe:

grpc_health_probe -addr=<host>:4317 # add TLS flags as needed (see tool docs)

grpc_health_probe works only if the server implements the standard gRPC health service. [3]

If the endpoint exposes server reflection

Use grpcurl:

grpcurl -plaintext <host>:4317 list

Or describe OTLP services (example pattern):

grpcurl -plaintext <host>:4317 describe opentelemetry.proto.collector.trace.v1

This requires reflection to be enabled on the server; otherwise grpcurl can’t “list/describe” without you supplying protos. [4]

Practical “health check” for an OpenTelemetry Collector

The Collector’s typical health endpoint is not on 4317; it’s provided by the health_check extension, which exposes an HTTP endpoint (commonly 0.0.0.0:13133). [5]

Example check:

curl -sf http://<host>:13133/ || echo "unhealthy"

Sources: [1] OTLP receiver defaults (localhost:4317) • [2] Collector hardening/default bind addresses • [3] grpc-health-probe tool (gRPC Health Checking Protocol) • [4] grpcurl usage example against OTLP • [5] Collector health_check extension endpoint (0.0.0.0:13133)

🌐 Web query:

gRPC health check wget vs nc tcp probe

💡 Result:

nc TCP probe: Only verifies that something is listening on the port (a TCP handshake succeeds). It cannot speak HTTP/2 or send a gRPC request, so it can’t validate gRPC service health—just basic connectivity. (kubernetes.io)

wget probe: wget is for HTTP(S) (typically HTTP/1.1). gRPC runs over HTTP/2 with gRPC framing, so wget generally cannot perform a real gRPC health check (unless your app also exposes a separate HTTP health endpoint). (grpc.io)

Actual gRPC health check: Implement and query the standard gRPC Health Checking Protocol (grpc.health.v1.Health/Check or Watch). Use purpose-built tools like grpc_health_probe (common in Kubernetes) or grpcurl to call the health service. (grpc.io)

Citations:

1: https://kubernetes.io/blog/2018/10/01/health-checking-grpc-servers-on-kubernetes/?utm_source=openai

2: https://grpc.io/docs/guides/health-checking/?utm_source=openai

3: https://grpc.io/docs/guides/health-checking/?utm_source=openai

🌐 Web query:

Datadog OTLP gRPC ingestion agent health check

💡 Result:

Datadog’s OTLP gRPC ingestion in the Datadog Agent is just a listener (typically port 4317) once OTLP ingestion is enabled. (docs.datadoghq.com)

There isn’t a documented “OTLP gRPC health check” RPC (like grpc.health.v1.Health/Check) for the Agent’s OTLP receiver. If you need a health signal for “can I ingest OTLP right now?”, the usual approach is:

Check the Agent health endpoint (readiness/liveness)

The Datadog Helm chart configures the Agent container to expose HTTP probes on a dedicated health port (default 5555) with:

GET /live

GET /ready

GET /startup (github.com)

Example:

curl -fsS http://<agent-ip>:5555/ready

Check that the OTLP gRPC port is reachable

Simple TCP connect check to 4317 (or whatever you configured):

nc -zv <agent-ip> 4317

If you specifically want a gRPC Health Checking Protocol-style probe, Datadog provides a check that can monitor endpoints that implement that protocol—but the service being checked must actually expose grpc.health.v1. (docs.datadoghq.com)

Citations:

1: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest_in_the_agent/

2: https://github.com/DataDog/helm-charts/raw/refs/heads/main/charts/datadog/templates/_container-agent.yaml

3: https://docs.datadoghq.com/integrations/grpc-check/

Replace wget with a TCP connectivity check for the gRPC port.

Using wget against the OTLP/gRPC endpoint on port 4317 is unreliable because gRPC runs over HTTP/2 with gRPC framing, not plain HTTP. The wget command will likely fail even when the agent is healthy and the port is reachable, creating false negatives that could mislead troubleshooting. Use nc -vz instead to verify basic TCP connectivity to the gRPC port.

Suggested fix

kubectl run -it --rm debug --image=alpine --restart=Never -- \ - wget -qO- http://<HOST_IP>:4317 + nc -vz <HOST_IP> 4317

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lfx_one/tracing.md` around lines 515 - 524, Replace the unreliable `wget` check against the gRPC OTLP endpoint with a plain TCP connectivity test: update the debug pod command that currently runs `wget -qO- http://<HOST_IP>:4317` to instead run `nc -vz <HOST_IP> 4317` (or `nc -zv` depending on the image) so you only verify TCP reachability to port 4317; keep references to `OTEL_EXPORTER_OTLP_ENDPOINT`, `HOST_IP`, and port `4317` so the docs explain that this verifies the Datadog Agent OTLP gRPC port is reachable (ECONNREFUSED still indicates OTLP ingestion is disabled).

Align all table columns, clarify semconv usage for additional span attributes with typed constructors and key constants, and add a reference table of common semconv attributes by domain. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-613 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

lfx_one/tracing.md (3)

249-255: ⚠️ Potential issue | 🟡 Minor

Guard manual log correlation when no span is active.

Without an sc.IsValid() check, this helper emits zero-value trace_id/span_id fields for untraced contexts.

Suggested fix

 func logWithTrace(ctx context.Context, msg string, args ...any) {
     sc := trace.SpanFromContext(ctx).SpanContext()
+    if !sc.IsValid() {
+        slog.InfoContext(ctx, msg, args...)
+        return
+    }
     attrs := []any{
         slog.String("trace_id", sc.TraceID().String()),
         slog.String("span_id", sc.SpanID().String()),
     }
     slog.InfoContext(ctx, msg, append(attrs, args...)...)
 }

OpenTelemetry Go: what does trace.SpanFromContext(ctx) return when no span is active, and should SpanContext.IsValid() be checked before reading TraceID/SpanID?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 249 - 255, The helper logWithTrace currently
always reads trace_id/span_id from the span context and will emit zero-value IDs
for non-traced contexts; update logWithTrace to call
trace.SpanFromContext(ctx).SpanContext(), check sc.IsValid() and only append
slog.String("trace_id", ...) and slog.String("span_id", ...) to attrs when
sc.IsValid() is true so untraced contexts do not get zero-value trace/span
fields.

345-348: ⚠️ Potential issue | 🟠 Major

This snippet mixes a traces-only guide with OTel metrics registration.

Earlier the doc says LFX only collects traces via OTEL, but this example registers DB stats metrics and never shows a MeterProvider/exporter. That is internally inconsistent and can send readers down a dead end. Either remove this call from the tracing guide, or document the metrics pipeline explicitly; if you keep it, also handle the returned registration/error and unregister on shutdown.

Minimal fix for this guide

-    // Record connection pool metrics as span events
-    otelsql.RegisterDBStatsMetrics(db,
-        otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
-    )
     return db, nil
 }

In github.com/XSAM/otelsql, does RegisterDBStatsMetrics emit OpenTelemetry metrics or span events, and what are its return values?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 345 - 348, The snippet mixes traces-only
guidance with a metrics call: otelsql.RegisterDBStatsMetrics (with
semconv.DBSystemPostgreSQL) registers DB metrics and returns a registration
handle and error, so either remove this call from the tracing guide or
explicitly document the metrics pipeline (provide a MeterProvider/exporter
setup) and handle the return values; if you keep it, capture the returned
registration and error from otelsql.RegisterDBStatsMetrics, check and log/handle
the error, and call the registration’s Unregister (or equivalent) on shutdown to
clean up the metrics registration.

553-559: ⚠️ Potential issue | 🟡 Minor

Use a TCP probe here instead of wget.

4317 is the OTLP/gRPC port, so wget can fail even when the endpoint is reachable. That makes this troubleshooting step noisy and misleading.

Suggested fix

 kubectl run -it --rm debug --image=alpine --restart=Never -- \
-  wget -qO- http://<HOST_IP>:4317
+  nc -vz <HOST_IP> 4317

For an OTLP gRPC endpoint on port 4317, is wget a valid connectivity check, or should a TCP/gRPC-specific probe such as nc or grpc_health_probe be used instead?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lfx_one/tracing.md` around lines 553 - 559, Replace the wget-based
connectivity check in the kubectl debug pod run command with a TCP/gRPC probe
(e.g., use nc/netcat or grpc_health_probe) because port 4317 is OTLP/gRPC and
HTTP probes can return false negatives; update the snippet that currently uses
"kubectl run ... -- wget -qO- http://<HOST_IP>:4317" to instead execute a TCP
connect or gRPC health check so the documentation demonstrates a correct
reachability test for the Datadog Agent OTLP gRPC port.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lfx_one/tracing.md`:
- Around line 24-30: Update the Mermaid diagram and accompanying prose so APM
traces flow directly from app to node agent to Datadog: modify the diagram lines
using the symbols app, agent, and dd to show app -- "OTLP/gRPC" --> agent and
agent -- "APM traces" --> dd (remove cluster_agent from the APM trace path and
any arrow from agent to cluster_agent or cluster_agent to dd for traces); then
edit the explanation text (lines 33-40) to explicitly state that traces are sent
from the application to the node agent and then directly to Datadog (the Datadog
Cluster Agent is not in the APM trace forwarding path).

---

Duplicate comments:
In `@lfx_one/tracing.md`:
- Around line 249-255: The helper logWithTrace currently always reads
trace_id/span_id from the span context and will emit zero-value IDs for
non-traced contexts; update logWithTrace to call
trace.SpanFromContext(ctx).SpanContext(), check sc.IsValid() and only append
slog.String("trace_id", ...) and slog.String("span_id", ...) to attrs when
sc.IsValid() is true so untraced contexts do not get zero-value trace/span
fields.
- Around line 345-348: The snippet mixes traces-only guidance with a metrics
call: otelsql.RegisterDBStatsMetrics (with semconv.DBSystemPostgreSQL) registers
DB metrics and returns a registration handle and error, so either remove this
call from the tracing guide or explicitly document the metrics pipeline (provide
a MeterProvider/exporter setup) and handle the return values; if you keep it,
capture the returned registration and error from otelsql.RegisterDBStatsMetrics,
check and log/handle the error, and call the registration’s Unregister (or
equivalent) on shutdown to clean up the metrics registration.
- Around line 553-559: Replace the wget-based connectivity check in the kubectl
debug pod run command with a TCP/gRPC probe (e.g., use nc/netcat or
grpc_health_probe) because port 4317 is OTLP/gRPC and HTTP probes can return
false negatives; update the snippet that currently uses "kubectl run ... -- wget
-qO- http://<HOST_IP>:4317" to instead execute a TCP connect or gRPC health
check so the documentation demonstrates a correct reachability test for the
Datadog Agent OTLP gRPC port.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 390fdaa5-9e5f-438d-af2e-2be2fb989744

📥 Commits

Reviewing files that changed from the base of the PR and between 87f0620 and 26a5ca7.

📒 Files selected for processing (1)

lfx_one/tracing.md

coderabbitai · 2026-03-11T18:22:44Z

lfx_one/tracing.md

+        cluster_agent["Datadog Cluster Agent"]
+    end
+    dd["Datadog"]
+
+    app -- "OTLP/gRPC" --> agent
+    agent -- "APM traces" --> cluster_agent
+    cluster_agent -- "Datadog API" --> dd


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does the Datadog Cluster Agent in Kubernetes receive/forward APM traces from Datadog Node Agents, or do Node Agents send traces directly to Datadog?

💡 Result:

In Kubernetes, APM traces do not flow through the Datadog Cluster Agent.

Application tracers send traces to the node-based Datadog Agent (DaemonSet) (via DD_AGENT_HOST / port 8126 or via the UDS socket), and that node agent sends the traces to Datadog. [1]

Datadog’s own Cluster Agent architecture explicitly describes the node agent as the component that “sends … APM traces … to the Datadog SaaS platform,” while the Cluster Agent focuses on cluster-level data/metadata. [2]

The Admission Controller is part of the Cluster Agent and can help auto-configure pods to talk to the node agent, but it’s not a trace forwarder. [1]

Answer: Node Agents send traces directly to Datadog; the Cluster Agent is not in the APM trace forwarding path. [1][2]

Sources
[1] Datadog Docs – Kubernetes APM: Trace Collection (docs.datadoghq.com)
[2] Datadog – Real-world applications of the Datadog Cluster Agent (Part 1) (datadoghq.com)

Citations:

1: https://docs.datadoghq.com/containers/kubernetes/apm/

2: https://www.datadoghq.com/architecture/real-world-applications-of-the-datadog-cluster-agent-part-one/

Correct the trace routing architecture in the diagram and explanation.

The Datadog Cluster Agent is not in the APM trace forwarding path. Node agents send traces directly to Datadog, not through the Cluster Agent. Update both the Mermaid diagram (lines 24-30) and the prose explanation (lines 33-40) to show: app → node agent → Datadog.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lfx_one/tracing.md` around lines 24 - 30, Update the Mermaid diagram and accompanying prose so APM traces flow directly from app to node agent to Datadog: modify the diagram lines using the symbols app, agent, and dd to show app -- "OTLP/gRPC" --> agent and agent -- "APM traces" --> dd (remove cluster_agent from the APM trace path and any arrow from agent to cluster_agent or cluster_agent to dd for traces); then edit the explanation text (lines 33-40) to explicitly state that traces are sent from the application to the node agent and then directly to Datadog (the Datadog Cluster Agent is not in the APM trace forwarding path).

Copilot AI review requested due to automatic review settings March 11, 2026 16:58

Copilot started reviewing on behalf of bramwelt March 11, 2026 16:59 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed tracing documentation#37

Add distributed tracing documentation#37
bramwelt wants to merge 2 commits intomainfrom
feature/LFXV2-613-tracing

bramwelt commented Mar 11, 2026

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bramwelt commented Mar 11, 2026

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Quick “is the port reachable?” (TCP connectivity)

gRPC-level connectivity (handshake)

If the endpoint exposes gRPC Health Checking Protocol

If the endpoint exposes server reflection

Practical “health check” for an OpenTelemetry Collector

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 11, 2026 •

edited

Loading