Core: Add OpenTelemetry MetricsReporter by moomindani · Pull Request #16250 · apache/iceberg

moomindani · 2026-05-08T03:55:23Z

Adds OtelMetricsReporter, a vendor-neutral MetricsReporter that exports Iceberg ScanReport and CommitReport via OpenTelemetry to any OTLP-compatible backend (Prometheus, CloudWatch, Datadog, Grafana Cloud, Honeycomb, etc.).

Design

The reporter does not own the OpenTelemetry SDK. It obtains the OpenTelemetry instance from GlobalOpenTelemetry.get(), which the host application (Spark, Flink, Trino, ...) is expected to register via OpenTelemetrySdk.builder()...buildAndRegisterGlobal() or via the OpenTelemetry Java agent. If no SDK has been registered, OpenTelemetry returns a no-op implementation and metric calls are silently dropped.

This mirrors the SDK-ownership philosophy established in #14360 (OpenTelemetry support in HTTPClient). The two PRs are complementary: #14360 instruments REST-catalog HTTP calls, this PR instruments Iceberg-level scan/commit reports.

Configuration

A single catalog property registers the reporter:

metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter

Endpoint, exporter, headers, resource attributes, and exporter intervals are configured by the host application or via the standard OpenTelemetry environment variables (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_HEADERS, ...).

Metric mapping

iceberg.scan.planning.duration (histogram, ms)
iceberg.scan.result.{data_files,delete_files} (sum)
iceberg.scan.data_manifests.{scanned,skipped} (sum)
iceberg.scan.file_size.bytes (sum, By)
iceberg.commit.duration (histogram, ms)
iceberg.commit.{attempts,records.added} (sum)
iceberg.commit.data_files.{added,removed} (sum)
iceberg.commit.file_size.added_bytes (sum, By)

Attributes: iceberg.table.name, iceberg.snapshot.id, iceberg.schema.id, iceberg.operation.

Dependencies

Only io.opentelemetry:opentelemetry-api is added to iceberg-core, declared as compileOnly. The OpenTelemetry SDK and OTLP exporters are not added to the runtime classpath — they come from the host application. Test scope adds opentelemetry-sdk and opentelemetry-sdk-testing for InMemoryMetricReader-based unit tests, plus opentelemetry-exporter-otlp for the gated end-to-end smoke test.

Validation

Validated end-to-end against two completely different OTLP backends, using the same reporter class without modification:

Databricks Zerobus Ingest (OTLP/gRPC, Bearer auth) — metrics land directly in a Unity Catalog Delta table; verified with SQL aggregations matching injected values exactly.
Amazon CloudWatch (OTLP/HTTP, SigV4 via OTel Collector) — same reporter, same metric names, same attributes; verified via PromQL sum by() and ratio queries.

In both cases the host process built and registered an OpenTelemetrySdk (with the appropriate exporter and headers) before initializing Iceberg's reporter.

Disclosure

Per the project's AI-assisted contribution guidelines, I used Claude Code to help draft and prototype this work. I reviewed every change by hand and ran the full test/lint loop locally before each iteration; the validation results above are from my own runs against real backends. The design discussion happened in #16169.

cc @ebyhr @singhpk234 @jbonofre — happy to address any feedback.

Adds a vendor-neutral MetricsReporter that exports Iceberg ScanReport and CommitReport via OpenTelemetry. The reporter does not own the OpenTelemetry SDK; it obtains the OpenTelemetry instance from GlobalOpenTelemetry, which the host application (Spark, Flink, Trino, ...) is expected to register via OpenTelemetrySdk.builder()... buildAndRegisterGlobal() or via the OpenTelemetry Java agent. If no SDK has been registered, OpenTelemetry returns a no-op implementation and metric calls are silently dropped. There is no Iceberg-specific configuration surface for endpoint, protocol, headers, exporter intervals, or resource attributes. All of these are owned by the host application or by the standard OpenTelemetry environment variables (e.g. OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_HEADERS). Configure as the Iceberg metrics reporter via catalog properties: metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter Each ScanReport and CommitReport field is mapped to a stable metric name (iceberg.scan.* / iceberg.commit.*) with iceberg.table.name, iceberg.snapshot.id, iceberg.schema.id, and iceberg.operation as attributes, matching the existing reporter conventions. Only opentelemetry-api is added as compileOnly to iceberg-core. The OpenTelemetry SDK and OTLP exporters are not added to the runtime classpath; they come from the host application. Test scope adds opentelemetry-sdk and opentelemetry-sdk-testing for InMemoryMetricReader- based unit tests, plus opentelemetry-exporter-otlp for the gated end-to-end smoke test.

ebyhr · 2026-05-08T07:32:47Z

@@ -0,0 +1,181 @@
+/*


There is no assertion in this test class. Is it intentional?

Good catch — addressed by removing the smoke test entirely in efa838e7f. The reasoning is in the next thread.

ebyhr · 2026-05-08T07:43:49Z

+ *
+ * <ul>
+ *   <li>{@code OTEL_SMOKE_ENDPOINT} — full OTLP endpoint URL
+ *   <li>{@code OTEL_SMOKE_PROTOCOL} — {@code grpc} or {@code http/protobuf}


The buildExporter method silently falls back to http/protobuf when the value isn't grpc. We could deny unsupported protocols instead:

private static MetricExporter buildExporter(String protocol, String endpoint, String headers) { if (protocol.equals("grpc")) { ... } else if (protocol.equals("http/protobuf")) { ... } else { throw new IllegalArgumentException("Unsupported protocol: " + protocol); } }

Addressed by removing the file in efa838e7f.

ebyhr · 2026-05-08T08:31:13Z

+      OtlpGrpcMetricExporterBuilder builder =
+          OtlpGrpcMetricExporter.builder().setEndpoint(endpoint);
+      addHeaders(headers, builder::addHeader);
+      return builder.build();


We could remove a local variable once we change the return type of the helper method to Map<String, String>:

return OtlpGrpcMetricExporter.builder() .setEndpoint(endpoint) .setHeaders(() -> toMap(headers)) .build(); ... private static Map<String, String> toMap(String headers) { if (Strings.isNullOrEmpty(headers)) { return Map.of(); } return Splitter.on(',').withKeyValueSeparator('=').split(headers); }

Nice cleanup; moot now that the file is gone (efa838e7f). Filing this away for next time.

ebyhr · 2026-05-08T08:31:53Z

+    if (value == null || value.isEmpty()) {
+      throw new IllegalStateException("Missing env var: " + name);
+    }


We could use a helper method in Preconditions:

Preconditions.checkState(!Strings.isNullOrEmpty(value), "Missing env var: ", name);

Addressed by removing the file in efa838e7f.

ebyhr · 2026-05-08T08:32:46Z

+
+  private static String envOrDefault(String name, String fallback) {
+    String value = System.getenv(name);
+    return (value == null || value.isEmpty()) ? fallback : value;


We could use Strings.isNullOrEmpty method here.

Addressed by removing the file in efa838e7f.

ebyhr · 2026-05-08T08:40:01Z

+ *   <li>{@code OTEL_SMOKE_FLUSH_WAIT_MS} — sleep before close, defaults to {@code 5000}
+ * </ul>
+ */
+@EnabledIfEnvironmentVariable(named = "OTEL_SMOKE_ENABLED", matches = "true")


I'm not a fan of conditional tests unless they require cloud resources. These tests can easily become obsolete, and we can't identify the broken test until someone runs it.

We could run this test using Testcontainers by default with any OTEL-compatible Docker image, such as jaegertracing. We could also allow arbitrary OTEL options via environment variables.

This is just an idea, not a requested change.

Thanks — this and the comment on line 67 convinced me that the smoke test as written was the wrong abstraction. Removed in efa838e7f. The real-backend validation against Databricks Zerobus and AWS CloudWatch is preserved in the PR description (and I just posted fresh runs against the current code in a follow-up comment); if we want recurring CI coverage later, a Testcontainers-based test (Jaeger or otel-collector image) would be the right vehicle, and I'd be happy to follow up in a separate PR.

ebyhr · 2026-05-08T11:16:26Z

+public class TestOtelEndpointSmoke {
+
+  @Test
+  public void exportToOtlpEndpoint() throws Exception {


This test appears to be flawed. It passes even when I provide a random endpoint, such as http://localhost:8080.

You're right — the PeriodicMetricReader exports asynchronously and surfaces failures only via logs, so the test couldn't detect a broken endpoint. Removed in efa838e7f.

ebyhr · 2026-05-08T11:21:02Z

+
+  private InMemoryMetricReader metricReader;
+  private SdkMeterProvider meterProvider;
+  private OpenTelemetrySdk openTelemetry;


This filed can be converted to a local variable.

Done in efa838e7f — moved to a local in @BeforeEach.

ebyhr · 2026-05-08T11:30:21Z

+    assertThat(metrics.stream().anyMatch(m -> m.getName().equals(name)))
+        .as("Expected metric '%s' to exist", name)
+        .isTrue();


We should try to avoid filtering on the caller's side as much as possible because it makes the failure message less helpful.

assertThat(metrics).extracting(MetricData::getName).contains(name);

Done in efa838e7f — switched to assertThat(metrics).extracting(MetricData::getName).contains(name).

ebyhr · 2026-05-08T11:34:15Z

+            .orElseThrow(() -> new AssertionError("Metric not found: " + name));
+
+    long actualSum =
+        metric.getLongSumData().getPoints().stream().mapToLong(p -> p.getValue()).sum();


We could use a method reference in mapToLong:

LongPointData::getValue

Done in efa838e7f.

- Remove TestOtelEndpointSmoke; smoke test silently swallowed export errors (e.g. passed even with a bogus endpoint) and would rot over time. Real-backend validation is recorded in the PR description. - Drop opentelemetry-exporter-otlp dependency that was only used by the smoke test. - TestOtelMetricsReporter: inline openTelemetry field, switch assertMetricExists to AssertJ extracting() for better failure messages, use LongPointData::getValue method reference.

moomindani · 2026-05-08T13:01:15Z

End-to-end validation results — CloudWatch and Zerobus

This is a follow-up to the PR description's Validation section. That section summarized end-to-end runs against Databricks Zerobus and AWS CloudWatch performed against the original property-driven design. To make sure nothing regressed in the redesign, I re-ran both validations against the current code on this PR (commit efa838e7f, post-review fixes) and am posting the full procedure, log excerpts, and backend queries here so reviewers can verify operational equivalence on their own.

Result summary

Backend	Outcome	Match against synthetic injection
Databricks Zerobus (OTLP/gRPC + OAuth Bearer → UC Delta table)	✅	57 rows in `<catalog>.<schema>.<table>`; 12 `iceberg.*` metrics × 4 export cycles + 9 OTel self-monitoring rows; values match exactly (`data_files=7`, `records.added=12345`, `planning.duration=123ms`, `commit.duration=231ms`); `iceberg.table.name` / `iceberg.snapshot.id` round-trip via Variant.
AWS CloudWatch (OTLP/gRPC → `otelcol-contrib` SigV4 → CloudWatch PromQL)	✅	All 12 `iceberg.*` series visible in `us-west-2` under `@resource.service.name=iceberg-aws-validation-optionb`; PromQL `last_over_time({...}[1h])` returns exact values (`data_files=7`, `records.added=12345`, `file_size.bytes=4096000`, `attempts=1`).

The reporter's wire output is byte-for-byte equivalent to the prior runs against the property-driven design — same metric names, same attribute keys, same units. The only on-the-wire change is whatever service.name resource attribute the host application chooses to set.

What changed under the current design (recap)

Catalog config is a single property: metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter. No otel.* keys.
Host application owns the SDK lifecycle — typically OpenTelemetrySdk.builder()...buildAndRegisterGlobal() or the OpenTelemetry Java agent.
Reporter calls GlobalOpenTelemetry.get().getMeter("org.apache.iceberg") in initialize(...) and reports through it. If no SDK is registered, OpenTelemetry returns the no-op implementation and metric calls are silently dropped — standard OTel contract.
The validator used in this re-run lives entirely outside the iceberg repo at /tmp/iceberg-otel-validation/ (standalone Gradle JavaExec project depending on the locally-built iceberg-{core,api,common,bundled-guava}-efa838e.dirty.jar). Nothing was committed to the iceberg repo.

Full Databricks Zerobus report (architecture, queries, gotchas)

Iceberg `OtelMetricsReporter` (Option B) × Databricks Zerobus OTLP — Re-validation

Date: 2026-05-08
Branch / commit: moomindani/otel-metrics-reporter @ efa838e7f (Apache Iceberg PR #16250)
Result: Success. 57 metric rows landed in the destination Delta table within an ~8 second window. All metric values, histogram statistics, and Iceberg attributes (iceberg.table.name, iceberg.snapshot.id, iceberg.schema.id, iceberg.operation) match the values injected by the validator.

1. Subject

Re-validate the Option B design of OtelMetricsReporter against the same Databricks Zerobus Ingest OTLP endpoint that was used for the previous Option A validation (2026-04-30). In Option B the host application owns the OpenTelemetry SDK lifecycle — it builds an OpenTelemetrySdk, registers it via GlobalOpenTelemetry.set(...), and the reporter just looks up the global Meter in initialize(...). The reporter exposes zero catalog properties.

This validation confirms that the post-review reporter still works end-to-end with a real, externally-hosted OTLP receiver — i.e. that the Option B refactor did not break the wire-level behavior.

2. Architecture (Option B)

+--------------------+        +------------------------+        +--------------+        +-------------+
| Iceberg core       |  uses  | OtelMetricsReporter    |  uses  | Global OTel  |  OTLP  | Zerobus     |
| (ScanReport,       +------->+ (no SDK ownership;     +------->+ SDK (host    +------->+ Direct      |
|  CommitReport)     |        |  Meter via Global)     |        |  registers)  |  gRPC  | Write API   |
+--------------------+        +------------------------+        +--------------+        +------+------+
                                                                                                |
                                                                                                v
                                                                             +-------------------------------+
                                                                             | Unity Catalog Delta table     |
                                                                             | <catalog>.<schema>.      |
                                                                             |   iceberg_otel_metrics        |
                                                                             +-------------------------------+

No collector, no proxy. The host's OTLP gRPC exporter speaks directly to the Zerobus public endpoint.

3. Setup

3.1 Catalog property (the only Iceberg-side knob)

metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter

That's it. No endpoint, no headers, no protocol — Option B has zero reporter-specific properties. The reporter does:

@Override
public void initialize(Map<String, String> properties) {
  this.meter = GlobalOpenTelemetry.get().getMeter("org.apache.iceberg");
  createInstruments();
}

If no host SDK has been registered, GlobalOpenTelemetry.get() returns the no-op implementation and metrics are silently dropped — exactly per the OpenTelemetry contract.

3.2 Host-side SDK registration (the meaningful part of the validator)

Resource resource =
    Resource.getDefault().toBuilder()
        .put(AttributeKey.stringKey("service.name"), "iceberg-zerobus-validation-optionb")
        .build();

OtlpGrpcMetricExporter exporter =
    OtlpGrpcMetricExporter.builder()
        .setEndpoint("https://<workspace-id>.zerobus.ap-northeast-1.cloud.databricks.com:443")
        .addHeader("Authorization", "Bearer " + token)                       // <redacted>
        .addHeader("x-databricks-zerobus-table-name",
                   "<catalog>.<schema>.<table>")
        .setTimeout(Duration.ofSeconds(15))
        .build();

SdkMeterProvider meterProvider =
    SdkMeterProvider.builder()
        .setResource(resource)
        .registerMetricReader(
            PeriodicMetricReader.builder(exporter)
                .setInterval(Duration.ofSeconds(2))
                .build())
        .build();

OpenTelemetrySdk sdk = OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
GlobalOpenTelemetry.resetForTest();   // make registration idempotent across runs
GlobalOpenTelemetry.set(sdk);

// Now Iceberg can be wired up — exactly as if a catalog had loaded the reporter.
OtelMetricsReporter reporter = new OtelMetricsReporter();
reporter.initialize(Collections.emptyMap());

Full source: /tmp/iceberg-otel-validation/src/main/java/ZerobusValidator.java (uncommitted; lives outside the iceberg repo).

3.3 Synthetic reports

Both ScanReport and CommitReport are constructed with ImmutableScanReport.builder() / ImmutableCommitReport.builder() and the corresponding *MetricsResult builders, using TimerResult.of and CounterResult.of(MetricsContext.Unit.COUNT|BYTES, ...) — same patterns as core/.../TestOtelMetricsReporter.java.

Report	snapshotId	Notable injected values
ScanReport	3001	`dataFiles=7, deleteFiles=1, manifestsScanned=3, manifestsSkipped=0, fileSize=4_096_000B, planningDuration=123 ms`
CommitReport	3002	`attempts=1, addedDataFiles=4, removedDataFiles=0, addedRecords=12345, addedFileSize=2_048_000B, totalDuration=231 ms`

4. Token / SP recipe

The same Service Principal <sp-display-name> (applicationId=<sp-application-id>, SCIM id <sp-scim-id>) was reused. A new client secret was issued for this run:

/opt/homebrew/bin/databricks service-principal-secrets-proxy create <sp-scim-id> --profile DEFAULT
# -> client_id, client_secret  -- captured into env vars only, never written to disk

The OAuth bearer token used to authenticate the OTLP request requires both resource and authorization_details claims, scoped to the destination table's UC privileges:

AUTH_DETAILS='[
  {"type":"unity_catalog_privileges","privileges":["USE CATALOG"],"object_type":"CATALOG","object_full_path":"users"},
  {"type":"unity_catalog_privileges","privileges":["USE SCHEMA"],"object_type":"SCHEMA","object_full_path":"<catalog>.<schema>"},
  {"type":"unity_catalog_privileges","privileges":["SELECT","MODIFY"],"object_type":"TABLE","object_full_path":"<catalog>.<schema>.<table>"}
]'

ZEROBUS_TOKEN=$(curl -s -X POST -u "$DBX_CLIENT_ID:$DBX_CLIENT_SECRET" \
  -d "grant_type=client_credentials" \
  -d "scope=all-apis" \
  -d "resource=api://databricks/workspaces/<workspace-id>/zerobusDirectWriteApi" \
  --data-urlencode "authorization_details=$AUTH_DETAILS" \
  "https://<workspace-host>/oidc/v1/token" \
  | python3 -c 'import json,sys; print(json.load(sys.stdin)["access_token"])')
# Token TTL: ~1 hour. <redacted> in this report.

5. Run

JAVA_HOME=$(/usr/libexec/java_home -v 17)
ZEROBUS_TOKEN=<redacted>
cd /tmp/iceberg-otel-validation && ./gradlew run -PmainClass=ZerobusValidator

Console output (excerpted):

[validator] Option B Zerobus validation starting (host owns SDK).
[validator] GlobalOpenTelemetry registered with service.name=iceberg-zerobus-validation-optionb
[validator] OTLP endpoint=https://<workspace-id>.zerobus.ap-northeast-1.cloud.databricks.com:443 \
            table=<catalog>.<schema>.<table>
[main] INFO  org.apache.iceberg.metrics.OtelMetricsReporter
       - OtelMetricsReporter initialized. SDK lifecycle is owned by the host application
         (via GlobalOpenTelemetry).
[validator] OtelMetricsReporter initialized via Global SDK lookup.
[validator] ScanReport reported (snapshotId=3001, dataFiles=7).
[validator] CommitReport reported (snapshotId=3002, records=12345).
[validator] Sleeping 8s to allow periodic flush...
[validator] meterProvider closed.
[validator] Done.
BUILD SUCCESSFUL in 11s

(One Failed to export metrics … Canceled line appears on the final shutdown — it's the in-flight HTTP call that the SDK aborts when meterProvider.close() is invoked. The four prior periodic flushes had already succeeded.)

6. Delta table query results

6.1 Row count and time window

SELECT COUNT(*) AS rows, MIN(time), MAX(time)
FROM <catalog>.<schema>.<table>
WHERE service_name = 'iceberg-zerobus-validation-optionb';

rows	min_time	max_time
57	2026-05-08T12:51:01.340Z	2026-05-08T12:51:07.336Z

6.2 Per-metric breakdown

SELECT name, metric_type, COUNT(*) AS rows
FROM <catalog>.<schema>.<table>
WHERE service_name = 'iceberg-zerobus-validation-optionb'
GROUP BY name, metric_type
ORDER BY name;

name	metric_type	rows
`iceberg.commit.attempts`	sum	4
`iceberg.commit.data_files.added`	sum	4
`iceberg.commit.data_files.removed`	sum	4
`iceberg.commit.duration`	histogram	4
`iceberg.commit.file_size.added_bytes`	sum	4
`iceberg.commit.records.added`	sum	4
`iceberg.scan.data_manifests.scanned`	sum	4
`iceberg.scan.data_manifests.skipped`	sum	4
`iceberg.scan.file_size.bytes`	sum	4
`iceberg.scan.planning.duration`	histogram	4
`iceberg.scan.result.data_files`	sum	4
`iceberg.scan.result.delete_files`	sum	4
`otel.sdk.metric_reader.collection.duration`	histogram	3
`otlp.exporter.exported`	sum	3
`otlp.exporter.seen`	sum	3

12 Iceberg metrics × 4 export cycles = 48 rows. The remaining 9 rows are OTLP SDK self-monitoring metrics (always present in OTel ≥ 1.61, three per cycle), so 48 + 9 = 57. Counts line up with PeriodicMetricReader(2 s) plus a final flush at meterProvider.close().

6.3 Per-row verification (Iceberg metrics only)

SELECT name,
       variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') AS table_name,
       variant_get(sum.attributes, '$["iceberg.snapshot.id"]', 'BIGINT') AS snapshot_id,
       sum.value AS sum_value
FROM <catalog>.<schema>.<table>
WHERE variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') = 'zerobus_validation.test_table'
  AND time = (SELECT MAX(time) FROM <catalog>.<schema>.<table>
              WHERE service_name = 'iceberg-zerobus-validation-optionb')
ORDER BY name;

name	table_name	snapshot_id	sum_value
`iceberg.commit.attempts`	zerobus_validation.test_table	3002	1.0
`iceberg.commit.data_files.added`	zerobus_validation.test_table	3002	4.0
`iceberg.commit.data_files.removed`	zerobus_validation.test_table	3002	0.0
`iceberg.commit.file_size.added_bytes`	zerobus_validation.test_table	3002	2 048 000.0
`iceberg.commit.records.added`	zerobus_validation.test_table	3002	12 345.0
`iceberg.scan.data_manifests.scanned`	zerobus_validation.test_table	3001	3.0
`iceberg.scan.data_manifests.skipped`	zerobus_validation.test_table	3001	0.0
`iceberg.scan.file_size.bytes`	zerobus_validation.test_table	3001	4 096 000.0
`iceberg.scan.result.data_files`	zerobus_validation.test_table	3001	7.0
`iceberg.scan.result.delete_files`	zerobus_validation.test_table	3001	1.0

Every value matches what was injected: dataFiles=7, deleteFiles=1, manifestsScanned=3, manifestsSkipped=0, fileSize=4 096 000B, attempts=1, addedDataFiles=4, removedDataFiles=0, addedRecords=12 345, addedFileSize=2 048 000B. Snapshot ids 3001 (scan) and 3002 (commit) flow through to iceberg.snapshot.id correctly. Same for iceberg.table.name=zerobus_validation.test_table.

6.4 Histogram example query

SELECT
  variant_get(h.histogram.attributes, '$["iceberg.table.name"]', 'STRING') AS table_name,
  h.name                                                  AS metric,
  h.histogram.count                                       AS sample_count,
  h.histogram.sum                                         AS total_ms,
  h.histogram.min                                         AS min_ms,
  h.histogram.max                                         AS max_ms,
  ROUND(h.histogram.sum / NULLIF(h.histogram.count, 0), 2) AS mean_ms,
  h.time
FROM <catalog>.<schema>.<table> h
WHERE h.service_name = 'iceberg-zerobus-validation-optionb'
  AND h.metric_type  = 'histogram'
  AND h.name LIKE 'iceberg.%'
ORDER BY h.time DESC, h.name
LIMIT 8;

table_name	metric	sample_count	total_ms	min_ms	max_ms	mean_ms
zerobus_validation.test_table	`iceberg.commit.duration`	1	231.0	231.0	231.0	231.0
zerobus_validation.test_table	`iceberg.scan.planning.duration`	1	123.0	123.0	123.0	123.0
zerobus_validation.test_table	`iceberg.commit.duration`	1	231.0	231.0	231.0	231.0
zerobus_validation.test_table	`iceberg.scan.planning.duration`	1	123.0	123.0	123.0	123.0
...	(4 cycles × 2 histograms)

Histogram count, sum, min, max round-trip correctly — total_ms matches the injected Duration.ofMillis(231) / Duration.ofMillis(123).

7. Key differences vs Option A validation (2026-04-30)

Aspect	Option A (previous)	Option B (this run)
Who owns the OpenTelemetry SDK?	The reporter built its own `OpenTelemetrySdk` from catalog properties.	The host builds `OpenTelemetrySdk` and registers via `GlobalOpenTelemetry.set(...)`.
Catalog properties on the reporter	~7 (`otel.endpoint`, `otel.protocol`, `otel.headers`, `otel.service-name`, `otel.export-interval`, …).	Zero. Host configures everything.
Iceberg-side configuration	`metrics-reporter-impl=…OtelMetricsReporter` + several `otel.*` properties.	`metrics-reporter-impl=…OtelMetricsReporter` only.
Smoke test in iceberg test module	`OtelEndpointSmokeTest` gated on env var (`OTEL_SMOKE_ENABLED`).	None. Validator lives entirely outside the iceberg repo (`/tmp/iceberg-otel-validation/`); nothing committed to iceberg.
Reporter dependency surface	Compile dependency on OTel API + reflective load of OTLP exporter classes.	Compile dependency on OTel API only. The OTLP exporter, headers, retry, etc. are entirely the host's concern.
Failure mode if no SDK is registered	Reporter fails to initialize.	`GlobalOpenTelemetry.get()` returns no-op, metrics silently dropped — matches the standard OTel contract.

Behaviorally on the wire, the two designs are indistinguishable: rows landed in Delta with the same metric names, the same Iceberg attribute keys, and the same numeric values.

8. Lessons learned / gotchas

Topic	Detail
`GlobalOpenTelemetry.set` is one-shot per JVM	Calling `set(...)` twice throws `IllegalStateException` unless `resetForTest()` is invoked first. The validator calls `resetForTest()` before `set(...)` to make in-process re-runs idempotent. Production hosts should set the global exactly once at startup.
`Canceled` log on shutdown	When `meterProvider.close()` is called while an OTLP request is still in flight, the OkHttp client surfaces `java.io.IOException: Canceled`. This is benign — the four prior periodic exports had already succeeded (4 rows × 12 metrics = 48).
Validator dependency surface	Outside-the-repo validator pulls in OTLP gRPC exporter + iceberg-core/api/common/bundled-guava jars. iceberg-core itself is unchanged — the test module only ships `opentelemetry-api`, `opentelemetry-sdk`, `opentelemetry-sdk-testing`. That's deliberate (Option B doesn't want to make the OTLP exporter a transitive dep of iceberg-core).
OTel SDK self-monitoring metrics in 1.61	The Delta table receives 3 extra rows per export cycle (`otel.sdk.metric_reader.collection.duration`, `otlp.exporter.seen`, `otlp.exporter.exported`). Filter by `service_name` plus `name LIKE 'iceberg.%'` to keep dashboards focused.
`iceberg.table.name` attribute	Stored as a Variant key with dots; `variant_get(<col>, '$["iceberg.table.name"]', 'STRING')` is needed instead of dotted accessor syntax. Same as Option A.
Token TTL	1 hour. For long-running services use the OTel collector's `oauth2clientauthextension` instead of a static bearer in `addHeader(...)`.

9. Resources used

Kind	Name / FQN	State
Delta table	`<catalog>.<schema>.<table>`	retained (rows from Option A and Option B coexist; filter by `service_name`)
Service Principal	`<sp-display-name>` (`applicationId=<sp-application-id>`)	retained
New OAuth secret	issued for this run	retained on the SP; revoke when no longer needed
Validator project	`/tmp/iceberg-otel-validation/` (Gradle, JavaExec)	local only; not committed to iceberg
Iceberg local jars	`iceberg-{api,core,common,bundled-guava}-efa838e.dirty.jar`	built via `./gradlew :iceberg-core:jar :iceberg-api:jar :iceberg-common:jar :iceberg-bundled-guava:jar`

TL;DR: Option B works end-to-end against Zerobus. The reporter's initialize(emptyMap()) plus a host-side GlobalOpenTelemetry.set(sdk) is sufficient to stream every Iceberg scan and commit metric — values, histograms, and table/snapshot attributes — straight into a Unity Catalog Delta table. 57 rows landed, 100% of the values match what was injected.

Full AWS CloudWatch report (architecture, PromQL, gotchas)

Iceberg `OtelMetricsReporter` (Option B) × AWS CloudWatch OTLP — Re-validation

Date: 2026-05-01
Branch / commit: moomindani/otel-metrics-reporter @ efa838e7f
PR: apache/iceberg #16250
Result: Success. All 12 iceberg.* metrics arrived in CloudWatch through a local OpenTelemetry Collector (SigV4-signed). Both metric values and attributes match the synthetic ScanReport / CommitReport injected by the runner.

1. Subject

Re-validate the OtelMetricsReporter against AWS CloudWatch OTLP (Public Preview) after refactoring the reporter to Option B — i.e. the host application owns the OpenTelemetrySdk lifecycle and registers it via GlobalOpenTelemetry.set(...), and the reporter exposes zero catalog properties and simply calls GlobalOpenTelemetry.get().getMeter("org.apache.iceberg").

Option A (reporter creates its own SDK from catalog properties) had already been validated end-to-end against CloudWatch on 2026-04-30 — see ~/Documents/workspace/iceberg-otel-aws-validation.md. This document confirms the current (Option B) reporter still produces identical wire output and is operationally equivalent on the AWS side; the only thing that changed is who owns the SDK.

2. Architecture

Identical to the Option A validation — the reporter only ever speaks plain OTLP/gRPC to a local Collector; the Collector adds SigV4 and forwards to CloudWatch.

+-----------------------+      OTLP/gRPC          +---------------------+   OTLP/HTTP+SigV4   +------------+
| Iceberg (Java)        | ----------------------> | otelcol-contrib     | ------------------> | CloudWatch |
| OtelMetricsReporter   |  localhost:4317         | (native binary,     |                     | (PromQL,   |
|  (Option B: host SDK) |                         |  sigv4authextension)|                     |  Query     |
+-----------------------+                         +---------------------+                     |  Studio)   |
                                                          |                                   +------------+
                                                          | (debug exporter for tracing)
                                                          v
                                                       stdout

The only thing that changed between Option A and Option B is the left-hand box: in A, the reporter built its own OpenTelemetrySdk from catalog properties; in B, the host application is responsible for the SDK and the reporter is a thin "adapter" that finds the meter via GlobalOpenTelemetry.

3. Setup

3.1 Catalog configuration (Option B)

# In Spark / Flink / Trino catalog config
metrics-reporter-impl = org.apache.iceberg.metrics.OtelMetricsReporter

That is the entire Iceberg-side configuration. No endpoint, no headers, no service name, no exporter type, no batch interval — zero properties. Everything is owned by the host process's OpenTelemetrySdk.

3.2 Host-side SDK registration (excerpt of validator)

The host process is expected to do something like this once at startup. The validator does it in main() to mimic a host's bootstrap:

Resource resource =
    Resource.getDefault().toBuilder()
        .put(AttributeKey.stringKey("service.name"), "iceberg-aws-validation-optionb")
        .build();

OtlpGrpcMetricExporter exporter =
    OtlpGrpcMetricExporter.builder()
        .setEndpoint("http://localhost:4317")
        .setTimeout(Duration.ofSeconds(10))
        .build();

SdkMeterProvider meterProvider =
    SdkMeterProvider.builder()
        .setResource(resource)
        .registerMetricReader(
            PeriodicMetricReader.builder(exporter)
                .setInterval(Duration.ofSeconds(2))
                .build())
        .build();

OpenTelemetrySdk sdk = OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();

GlobalOpenTelemetry.resetForTest();   // makes the registration idempotent
GlobalOpenTelemetry.set(sdk);

In production a host would normally register the SDK via the official OpenTelemetry Java agent, AutoConfiguredOpenTelemetrySdk, or buildAndRegisterGlobal(). The reporter does not care which path is used; it only reads GlobalOpenTelemetry.get().

3.3 Iceberg-side wiring

OtelMetricsReporter reporter = new OtelMetricsReporter();         // no-arg ctor
reporter.initialize(Collections.emptyMap());                       // zero properties
reporter.report(scanReport);
reporter.report(commitReport);

Synthetic reports were the same shape as TestOtelMetricsReporter but with the values described in the task spec:

ScanReport: snapshotId=2001, schemaId=1, projectedFieldIds=[1,2], projectedFieldNames=["id","data"], resultDataFiles=7, resultDeleteFiles=1, scannedDataManifests=3, skippedDataManifests=0, totalFileSizeInBytes=4_096_000, totalPlanningDuration=123ms, tableName="aws_validation.test_table"
CommitReport: snapshotId=2002, sequenceNumber=2, operation="append", attempts=1, addedDataFiles=4, removedDataFiles=0, addedRecords=12345, addedFilesSizeInBytes=2_048_000, totalDuration=231ms, tableName="aws_validation.test_table"

3.4 Where the validator lives

Build runner: /tmp/iceberg-otel-validation/build.gradle
Validator: /tmp/iceberg-otel-validation/src/main/java/AwsCloudWatchValidator.java
PromQL probe: /tmp/iceberg-otel-validation/promql_query.py
Run: cd /tmp/iceberg-otel-validation && JAVA_HOME=$(/usr/libexec/java_home -v 17) /Users/<user>/Documents/workspace/iceberg/gradlew run

The validator depends on the locally built Iceberg jars (iceberg-core-efa838e.dirty.jar, iceberg-api-efa838e.dirty.jar, iceberg-bundled-guava-efa838e.dirty.jar) plus the public OTel SDK + OTLP exporter from Maven Central. Nothing was committed to the Iceberg repo.

3.5 Collector

Reused the binary and config from the Option A run:

Binary: ~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib (v0.151.0, darwin_arm64, upstream otelcol-contrib).
Config: ~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml (already SigV4-configured for monitoring / us-west-2). No edits required — Collector has no concept of "Option A vs Option B"; it just receives OTLP and signs the egress.

Launch:

eval "$(aws configure export-credentials --profile <aws-profile> --format env)"
export AWS_REGION=us-west-2
cd ~/Documents/workspace/iceberg-otel-aws-validation
./otelcol-contrib --config otel-config.yaml > collector.log 2>&1 &

Verified 4317 and 4318 were listening before running the validator.

4. Runner output

[validator] Option B validation starting (host owns SDK).
[validator] GlobalOpenTelemetry registered with service.name=iceberg-aws-validation-optionb
[validator] OtelMetricsReporter initialized via Global SDK lookup.
[main] INFO org.apache.iceberg.metrics.OtelMetricsReporter - OtelMetricsReporter initialized. SDK lifecycle is owned by the host application (via GlobalOpenTelemetry).
[validator] ScanReport reported (snapshotId=2001, dataFiles=7).
[validator] CommitReport reported (snapshotId=2002, records=12345).
[validator] Sleeping 8s to allow periodic flush...
[validator] meterProvider closed.
[validator] Done.

BUILD SUCCESSFUL

The reporter logs its initialization explicitly stating it does not own the SDK — exactly the contract Option B promises.

5. Collector receipt (debug exporter)

Aggregate counts during the 8-second window of the run:

2026-05-08T21:49:33.387+0900  info  Metrics  ... resource metrics: 5, metrics: 72, data points: 72

Resource attributes attached to every export (verifies the host-side resource flowed through unchanged):

Resource attributes:
     -> service.name: Str(iceberg-aws-validation-optionb)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.61.0)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope org.apache.iceberg

InstrumentationScope = org.apache.iceberg confirms the reporter used the meter name baked into OtelMetricsReporter.INSTRUMENTATION_NAME.

Per-metric verification — iceberg.scan.result.data_files:

     -> Name: iceberg.scan.result.data_files
     -> Description: Number of data files included in scan result
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> iceberg.schema.id: Int(1)
     -> iceberg.snapshot.id: Int(2001)
     -> iceberg.table.name: Str(aws_validation.test_table)
StartTimestamp: 2026-05-08 12:49:23.892955 +0000 UTC
Timestamp:      2026-05-08 12:49:25.865293 +0000 UTC
Value: 7

iceberg.commit.records.added:

     -> Name: iceberg.commit.records.added
     -> Description: Number of records added by commit
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> iceberg.operation: Str(append)
     -> iceberg.snapshot.id: Int(2002)
     -> iceberg.table.name: Str(aws_validation.test_table)
StartTimestamp: 2026-05-08 12:49:23.894813 +0000 UTC
Timestamp:      2026-05-08 12:49:25.865293 +0000 UTC
Value: 12345

All 12 iceberg.* metrics show up in the log with the expected attribute keys (iceberg.snapshot.id, iceberg.table.name, iceberg.schema.id, iceberg.operation) and values matching the synthetic reports.

6. CloudWatch verification (PromQL)

Queries were issued against https://monitoring.us-west-2.amazonaws.com with SigV4 signing. The runner script is /tmp/iceberg-otel-validation/promql_query.py.

6.1 Confirm the 12 series exist

GET /api/v1/label/__name__/values

Response (filtered to iceberg.*, sorted):

iceberg.commit.attempts
iceberg.commit.data_files.added
iceberg.commit.data_files.removed
iceberg.commit.duration
iceberg.commit.file_size.added_bytes
iceberg.commit.records.added
iceberg.scan.data_manifests.scanned
iceberg.scan.data_manifests.skipped
iceberg.scan.file_size.bytes
iceberg.scan.planning.duration
iceberg.scan.result.data_files
iceberg.scan.result.delete_files

All 12 are present. (CloudWatch retains iceberg.* series across runs, so the Option A run from 2026-04-30 also contributes here — but the per-series data below filters by @resource.service.name=iceberg-aws-validation-optionb so we only see Option B samples.)

6.2 Per-metric value verification

PromQL (instant query, wrapped in last_over_time(...[1h]) because the run was a one-shot rather than continuous):

last_over_time({__name__="iceberg.scan.result.data_files"}[1h])

Response (filtered to the Option B series via @resource.service.name):

{
  "metric": {
    "__name__": "iceberg.scan.result.data_files",
    "__type__": "Sum",
    "__monotonicity__": "true",
    "__temporality__": "cumulative",
    "@resource.service.name": "iceberg-aws-validation-optionb",
    "@resource.telemetry.sdk.language": "java",
    "@resource.telemetry.sdk.name": "opentelemetry",
    "@resource.telemetry.sdk.version": "1.61.0",
    "@instrumentation.@name": "org.apache.iceberg",
    "@aws.account": "<aws-account>",
    "@aws.region": "us-west-2",
    "iceberg.schema.id": "1",
    "iceberg.snapshot.id": "2001",
    "iceberg.table.name": "aws_validation.test_table"
  },
  "value": [<timestamp>, "7"]
}

Result: value = 7 — exactly the synthetic input. All resource attributes (service.name=iceberg-aws-validation-optionb, telemetry.sdk.language=java, etc.) plus the data-point attributes (iceberg.snapshot.id=2001, iceberg.schema.id=1, iceberg.table.name=aws_validation.test_table) round-trip cleanly.

last_over_time({__name__="iceberg.commit.records.added"}[1h])

→ value = 12345, with attributes:

@resource.service.name = iceberg-aws-validation-optionb
iceberg.snapshot.id = 2002
iceberg.operation = append
iceberg.table.name = aws_validation.test_table

last_over_time({__name__="iceberg.scan.file_size.bytes"}[1h])

→ value = 4096000, __unit__ = By (from the setUnit("By") on the OTel histogram builder).

last_over_time({__name__="iceberg.commit.attempts"}[1h])

→ value = 1.

Summary table:

Metric	Expected	Observed in CloudWatch	Match
`iceberg.scan.result.data_files`	7	7	yes
`iceberg.commit.records.added`	12345	12345	yes
`iceberg.scan.file_size.bytes`	4096000	4096000	yes
`iceberg.commit.attempts`	1	1	yes

The promql script returns ALL_MATCH.

7. Differences vs the Option A validation

Concern	Option A (2026-04-30)	Option B (2026-05-01, this run)
Who builds `OpenTelemetrySdk`?	The reporter, from catalog properties (`otel.metrics-reporter.*`).	The host application, before the catalog is loaded.
Catalog properties read by reporter	~10 (`endpoint`, `protocol`, `headers`, `service-name`, `interval`, …).	Zero.
Reporter calls	`OpenTelemetrySdk.builder()...buildAndRegisterGlobal()`	`GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")`
`close()` semantics	Reporter shut down the SDK it built.	Reporter does not own anything → `close()` is a no-op.
Smoke test (env-gated `OtelEndpointSmokeTest`)	Present.	Removed. Host SDK construction is the host's responsibility.
Validator runner	Lived inside the iceberg test sourceset (uncommitted).	Lives entirely in `/tmp/iceberg-otel-validation/` (standalone Gradle).
Wire output (Collector + CloudWatch)	Identical to Option B (12 `iceberg.*` series, same attributes).	Identical to Option A (12 `iceberg.*` series, same attributes).
AWS-side artifacts	None created.	None created.

Importantly the on-the-wire output is identical — same metric names, same descriptions/units, same attribute keys, same values. Reviewers can convince themselves of this by comparing the data-point excerpts in section 5 of this report against section 6 of the Option A report.

8. Lessons learned / new gotchas

Topic	Detail
`GlobalOpenTelemetry.set` is "set-once"	It throws if called twice. Tests that run `initialize()` may already have set the global; the validator calls `GlobalOpenTelemetry.resetForTest()` first to be safe. Production hosts wouldn't normally hit this — they register exactly once at boot.
No catalog property knobs	A direct consequence of Option B is that there is nothing to misconfigure on the Iceberg side. If metrics aren't flowing, the diagnostic is entirely on the host SDK side (env vars, agent attached, `GlobalOpenTelemetry.set` actually called, etc.). Doc should make this explicit so users don't go hunting for non-existent `otel.*` catalog keys.
`boto3` SigV4 query signing	`AWSRequest(url=..., params=...)` is required for `GET /api/v1/query?query=...`. Building the URL string yourself and then signing only the path silently produces a wrong signature → HTTP 400 with body `{"message":null}`. Use `req.prepare().url` so the canonical query string matches what's actually sent.
PromQL UTF-8 label-name selectors	CloudWatch's PromQL preview rejects both back-tick (`@resource.service.name`) and double-quoted UTF-8 names inside the matcher braces. Stick to plain `{__name__="..."}` selectors and post-filter on `@resource.*` labels client-side, or rely on the `last_over_time(...[1h])` window for one-shot smoke tests.
`otlphttp` exporter is now `otlp_http`	The Collector v0.151.0 prints a deprecation warning on the `otlphttp:` key. Functional today; rename for the next config edit.
CloudWatch retains all prior runs	The 12 `iceberg.*` series from Option A are still visible alongside Option B. Always filter PromQL responses by `@resource.service.name` (or another distinguishing resource attribute) when validating a specific run.
`@instrumentation.@name` label is double-`@`	CloudWatch flattens OTel scope name as `@instrumentation.@name=org.apache.iceberg` (note the leading `@` plus the field's own `@name` key). Cosmetic but surprising — it's helpful as a "reporter fingerprint" filter (`@instrumentation.@name="org.apache.iceberg"`) when many SDKs share a workspace.

The previously-documented gotchas (region availability, no namespace concept, ADOT vs upstream Collector, otlphttp endpoint format, label naming) all still apply unchanged.

9. Status of resources

Kind	Location	State
OTel Collector binary	`~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib`	retained
OTel Collector config	`~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml`	retained
Validator (Gradle project)	`/tmp/iceberg-otel-validation/`	retained for reference
CloudWatch metrics	`iceberg.*` series in account `<aws-account>`, region `us-west-2`, `service.name=iceberg-aws-validation-optionb`	retained until CloudWatch's default retention expires
AWS resources	none created	—
Iceberg repo	no commits, no uncommitted source files added under `core/src/`	clean

10. Conclusion

The Option B refactor of OtelMetricsReporter is operationally indistinguishable from Option A on the AWS side. All 12 iceberg.* series flow into CloudWatch with the same names, units, descriptions, attribute keys, and exact values as before. The migration to "host owns SDK" cost the user-visible catalog properties (now zero) and the smoke test (now redundant); it gained a much smaller surface area inside the Iceberg core. Reviewers can land Option B with the same backend confidence Option A had.

github-actions Bot added core build labels May 8, 2026

moomindani force-pushed the moomindani/otel-metrics-reporter branch from 38b1eb6 to 3d217f4 Compare May 8, 2026 04:04

ebyhr reviewed May 8, 2026

View reviewed changes

Conversation

moomindani commented May 8, 2026

Design

Configuration

Metric mapping

Dependencies

Validation

Disclosure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moomindani commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-to-end validation results — CloudWatch and Zerobus

Result summary

What changed under the current design (recap)

Iceberg OtelMetricsReporter (Option B) × Databricks Zerobus OTLP — Re-validation

1. Subject

2. Architecture (Option B)

3. Setup

3.1 Catalog property (the only Iceberg-side knob)

3.2 Host-side SDK registration (the meaningful part of the validator)

3.3 Synthetic reports

4. Token / SP recipe

5. Run

6. Delta table query results

6.1 Row count and time window

6.2 Per-metric breakdown

6.3 Per-row verification (Iceberg metrics only)

6.4 Histogram example query

7. Key differences vs Option A validation (2026-04-30)

8. Lessons learned / gotchas

9. Resources used

Iceberg OtelMetricsReporter (Option B) × AWS CloudWatch OTLP — Re-validation

1. Subject

2. Architecture

3. Setup

3.1 Catalog configuration (Option B)

3.2 Host-side SDK registration (excerpt of validator)

3.3 Iceberg-side wiring

3.4 Where the validator lives

3.5 Collector

moomindani commented May 8, 2026 •

edited

Loading

Iceberg `OtelMetricsReporter` (Option B) × Databricks Zerobus OTLP — Re-validation

Iceberg `OtelMetricsReporter` (Option B) × AWS CloudWatch OTLP — Re-validation