Skip to content

Entlein/adaptive write perf#38

Open
entlein wants to merge 94 commits into
mainfrom
entlein/adaptive-write-perf
Open

Entlein/adaptive write perf#38
entlein wants to merge 94 commits into
mainfrom
entlein/adaptive-write-perf

Conversation

@entlein
Copy link
Copy Markdown

@entlein entlein commented May 9, 2026

Summary: rewrite the adaptive write

Test Plan: see .local until the gh workflows work

Type of change: /kind refactor

ddelnano and others added 30 commits April 12, 2026 18:57
…house

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ntext

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…e context prometheus backends and test out the write clickhouse experiment

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ging

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
…e schema_creation option , probably needs complete rewrite, this is to make it work e2e for the lab now

Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 12:37 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/e2e_test/perf_tool/pkg/suites/suites.go (1)

108-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove hardcoded credential fallbacks from ClickHouse DSNs.

Line 110 and Line 221 embed default credentials in source. Require runtime-provided secrets and fail fast when missing.

🔐 Proposed fix
 	clickhouseDSN := os.Getenv("CLICKHOUSE_DSN")
 	if clickhouseDSN == "" {
-		clickhouseDSN = "pixie:pixie_password@clickhouse.forensic.austrianopencloudcommunity.org:9000/default"
+		panic("CLICKHOUSE_DSN environment variable is required")
 	}
@@
 	clickhouseCreds := os.Getenv("SOC_CH_CREDS")
 	if clickhouseCreds == "" {
-		// Lab default — matches the public demo CH credentials. Override
-		// via SOC_CH_CREDS for any production / non-demo target.
-		clickhouseCreds = "pixie:pixie_password"
+		panic("SOC_CH_CREDS environment variable is required")
 	}

Also applies to: 217-222

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/perf_tool/pkg/suites/suites.go` around lines 108 - 111, Remove
the hardcoded ClickHouse DSN fallback and require the CLICKHOUSE_DSN environment
variable to be present; in the code that sets clickhouseDSN (variable
clickhouseDSN in suites.go) delete the default `"pixie:pixie_password@..."`
value and instead check os.Getenv("CLICKHOUSE_DSN") and fail fast (log.Fatalf or
return an error) when it's empty so runtime-provided secrets are mandatory;
apply the same change to the other occurrence around the second block (the
repeated clickhouseDSN assignment) so no hardcoded credentials remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/e2e_test/perf_tool/pkg/suites/suites.go`:
- Around line 108-111: Remove the hardcoded ClickHouse DSN fallback and require
the CLICKHOUSE_DSN environment variable to be present; in the code that sets
clickhouseDSN (variable clickhouseDSN in suites.go) delete the default
`"pixie:pixie_password@..."` value and instead check os.Getenv("CLICKHOUSE_DSN")
and fail fast (log.Fatalf or return an error) when it's empty so
runtime-provided secrets are mandatory; apply the same change to the other
occurrence around the second block (the repeated clickhouseDSN assignment) so no
hardcoded credentials remain.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: eea1556b-abb4-4dd6-92cb-35bb33a83fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 023c086 and 357864f.

📒 Files selected for processing (2)
  • .github/workflows/perf_soc_attack.yaml
  • src/e2e_test/perf_tool/pkg/suites/suites.go

@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 14:26 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 15:07 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 18:16 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 18:19 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 21:08 — with GitHub Actions Inactive
@entlein
Copy link
Copy Markdown
Author

entlein commented May 17, 2026

@CodeRabbit review ALL of this, ignore .local
Pay special intention on overall consistency

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

🧩 Analysis chain

@entlein, I'll perform a thorough review. Let me first gather some information from the codebase to check for consistency issues.


🏁 Script executed:

#!/bin/bash
# Check callers of removed/changed functions
echo "=== Callers of ExecuteScript ==="
rg -n "ExecuteScript" --type=go -C2

echo ""
echo "=== Callers of ReadScriptDefinitions ==="
rg -n "ReadScriptDefinitions" --type=go -C2

echo ""
echo "=== Callers of DisableClickHousePlugin ==="
rg -n "DisableClickHousePlugin" --type=go -C2

echo ""
echo "=== Callers of UpdateDataRetentionScript ==="
rg -n "UpdateDataRetentionScript" --type=go -C2

echo ""
echo "=== Callers of NewWorkload (check skaffoldStderrFile param) ==="
rg -n "NewWorkload" --type=go -C3

echo ""
echo "=== Callers of NewMetricsRecorder ==="
rg -n "NewMetricsRecorder" --type=go -C3

echo ""
echo "=== BuiltinTables count in tables.go ==="
grep -n "TableSpec{" src/vizier/services/adaptive_export/internal/pxl/tables.go | wc -l
cat src/vizier/services/adaptive_export/internal/pxl/tables.go

echo ""
echo "=== tables_test BuiltinTables count assertion ==="
grep -n "12\|13\|14\|len(Builtin" src/vizier/services/adaptive_export/internal/pxl/tables_test.go

echo ""
echo "=== KnownTables in ddl.go ==="
grep -n "KnownTables\|http_events\|http2_messages\|dns_events\|redis_events\|mysql_events\|pgsql_events\|cql_events\|mongodb_events\|kafka_events\|amqp_events\|mux_events\|tls_events\|adaptive_attribution\|kubescape\|alerts" src/vizier/services/adaptive_export/internal/clickhouse/ddl.go | head -50

echo ""
echo "=== PixieTables in ddl.go ==="
grep -n "PixieTables\|pixieTables\|Pixie" src/vizier/services/adaptive_export/internal/clickhouse/ddl.go | head -20

echo ""
echo "=== OperatorOwnedTables in apply.go ==="
grep -n "OperatorOwnedTables\|forensic" src/vizier/services/adaptive_export/internal/clickhouse/apply.go | head -30

echo ""
echo "=== Experiment name format in suites.go ==="
grep -n "redis-attack\|experiment_name\|Nx\|multiplier" src/e2e_test/perf_tool/pkg/suites/suites.go | head -20

echo ""
echo "=== Experiment name in perf_soc_attack.yaml ==="
grep -n "redis-attack\|experiment_name" .github/workflows/perf_soc_attack.yaml

echo ""
echo "=== go.mod replace directive ==="
grep -n "replace\|protobuf" go.mod

echo ""
echo "=== Check if SOC_VIZIER_EXISTING is handled in sovereign_soc.go ==="
grep -n "SOC_VIZIER_EXISTING\|VIZIER_EXISTING" src/e2e_test/perf_tool/pkg/suites/sovereign_soc.go

echo ""
echo "=== Check skaffold_vizier opt profile ==="
grep -n "x86_64_sysroot\|compilation_mode" skaffold/skaffold_vizier.yaml

echo ""
echo "=== Check if NewRunner callers pass correct args ==="
rg -n "NewRunner" --type=go -C3

echo ""
echo "=== Check ClickHouseReadLoadWorkload usage (placeholder skaffold path) ==="
grep -n "ClickHouseReadLoadWorkload\|placeholder\|skaffold" src/e2e_test/perf_tool/pkg/suites/workloads.go | head -20

echo ""
echo "=== Check UINT128 format change in clickhouse_export_sink_node.cc ==="
grep -n "UINT128\|uuid\|sole\|absl::Substitute\|high64\|low64" src/carnot/exec/clickhouse_export_sink_node.cc | head -20

echo ""
echo "=== Check PEMHostnameUDF executor type ==="
grep -n "UDF_PEM\|UDF_KELVIN\|Executor" src/carnot/funcs/metadata/metadata_ops.h | head -20

echo ""
echo "=== Check PrometheusScrapeSpec new fields usage ==="
grep -rn "kubeconfig_path\|kube_context\|KubeconfigPath\|KubeContext" --include="*.go" | head -20

echo ""
echo "=== Check skip_namespace_delete usage ==="
grep -rn "skip_namespace_delete\|SkipNamespaceDelete" --include="*.go" --include="*.yaml" --include="*.proto" | head -20

echo ""
echo "=== Check exporter interface implementations ==="
rg -n "implements Exporter\|var _ Exporter\|var _ exporter" --type=go | head -10

echo ""
echo "=== ParquetLocalExporter compile-time assert ==="
grep -n "_ Exporter\|var _" src/e2e_test/perf_tool/pkg/exporter/parquet_local_exporter.go

echo ""
echo "=== Check for compile-time assert in parquet_exporter.go and bq_exporter.go ==="
grep -n "_ Exporter\|var _" src/e2e_test/perf_tool/pkg/exporter/parquet_exporter.go src/e2e_test/perf_tool/pkg/exporter/bq_exporter.go 2>/dev/null

echo ""
echo "=== Check SetKeepOnFailure usage ==="
grep -rn "SetKeepOnFailure\|keepOnFailure\|keep_on_failure" --include="*.go" | head -20

echo ""
echo "=== Check max_retries in workflow ==="
grep -n "max_retries\|keep_on_failure" .github/workflows/perf_soc_attack.yaml

echo ""
echo "=== Check adaptive_attribution columns in insert_test ==="
grep -n "adaptive_attribution\|expected.*col" src/vizier/services/adaptive_export/internal/clickhouse/insert_test.go | head -20

echo ""
echo "=== Check SovereignSOCRedisAttackExperiment params ==="
grep -n "SovereignSOCRedisAttackExperiment" src/e2e_test/perf_tool/pkg/suites/suites.go | head -5

echo ""
echo "=== Check experiment name format from sovereignSOCSuite ==="
grep -n "redis-attack\|fmt.Sprintf\|experiment.*name\|Name" src/e2e_test/perf_tool/pkg/suites/suites.go | head -20

echo ""
echo "=== Check delete.go changes for API aggregation ==="
grep -n "getAggregatedGroupVersions\|apiServiceGVR\|aggregated" src/utils/shared/k8s/delete.go | head -20

echo ""
echo "=== Check apply.go REST mapper scope change ==="
grep -n "RESTScopeNameRoot\|namespace\|cluster.*scope\|mapping.Scope" src/utils/shared/k8s/apply.go | head -20

Length of output: 256984


🏁 Script executed:

#!/bin/bash
echo "=== Check tables.go BuiltinTables full listing ==="
cat src/vizier/services/adaptive_export/internal/pxl/tables.go 2>/dev/null || echo "File not found"

echo ""
echo "=== Check tables_test.go content ==="
cat src/vizier/services/adaptive_export/internal/pxl/tables_test.go 2>/dev/null | head -60

echo ""
echo "=== Check ddl.go KnownTables listing ==="
cat src/vizier/services/adaptive_export/internal/clickhouse/ddl.go 2>/dev/null

echo ""
echo "=== Check insert_test.go adaptive_attribution columns ==="
cat src/vizier/services/adaptive_export/internal/clickhouse/insert_test.go 2>/dev/null | head -60

echo ""
echo "=== Check sink/clickhouse.go ==="
cat src/vizier/services/adaptive_export/internal/sink/clickhouse.go 2>/dev/null | head -100

echo ""
echo "=== Check cmd/main.go for removed function callers ==="
grep -n "ReadScriptDefinitions\|ExecuteScript\|DisableClickHousePlugin\|UpdateDataRetentionScript" src/vizier/services/adaptive_export/cmd/main.go 2>/dev/null

echo ""
echo "=== Check if parquet_exporter has compile-time interface assertion ==="
grep -n "_ Exporter\|_ exporter\|interface" src/e2e_test/perf_tool/pkg/exporter/parquet_exporter.go 2>/dev/null | head -10

echo ""
echo "=== Check bq_exporter compile-time assertion ==="
grep -n "_ Exporter\|_ exporter\|interface" src/e2e_test/perf_tool/pkg/exporter/bq_exporter.go 2>/dev/null | head -10

echo ""
echo "=== Check run.go for exporter usage ==="
grep -n "exporter\|Exporter\|ExportSpec\|ExportResults" src/e2e_test/perf_tool/pkg/run/run.go 2>/dev/null | head -30

echo ""
echo "=== Check cmd/run.go for createExporter ==="
grep -n "createExporter\|skaffoldStderr\|keepOnFailure\|KeepOnFailure\|metricsExporter" src/e2e_test/perf_tool/cmd/run.go 2>/dev/null | head -30

echo ""
echo "=== Check clickhouse_export_sink_node.cc for UINT128 ==="
grep -n "UINT128\|sole\|uuid\|Substitute\|high\|low" src/carnot/exec/clickhouse_export_sink_node.cc 2>/dev/null | head -20

echo ""
echo "=== Check sovereign_soc.go for SOC_VIZIER_EXISTING ==="
grep -n "SOC_VIZIER_EXISTING\|existing\|Existing" src/e2e_test/perf_tool/pkg/suites/sovereign_soc.go 2>/dev/null | head -20

echo ""
echo "=== Check experiment name consistency ==="
grep -n "redis-attack\|redis_attack\|redis4x\|4x\|multiplier\|Multiplier\|fmt.Sprintf" src/e2e_test/perf_tool/pkg/suites/suites.go 2>/dev/null | head -30

echo ""
echo "=== Check ClickHouseReadLoadWorkload for placeholder ==="
cat src/e2e_test/perf_tool/pkg/suites/workloads.go 2>/dev/null | grep -A 20 "ClickHouseReadLoadWorkload"

echo ""
echo "=== Check parquet_local vs parquet_gcs batch size defaults ==="
grep -n "batchSize\|BatchSize\|batch_size" src/e2e_test/perf_tool/pkg/exporter/parquet_local_exporter.go src/e2e_test/perf_tool/pkg/exporter/parquet_exporter.go 2>/dev/null | head -20

echo ""
echo "=== Check perf_clickhouse.yaml for experiment name ==="
grep -n "experiment\|suite\|redis\|clickhouse" .github/workflows/perf_clickhouse.yaml 2>/dev/null | head -20

echo ""
echo "=== Check if http2_messages.beta is in BuiltinTables ==="
grep -n "http2\|beta\|dotted\|http2_messages" src/vizier/services/adaptive_export/internal/pxl/tables.go src/vizier/services/adaptive_export/internal/clickhouse/ddl.go 2>/dev/null | head -20

echo ""
echo "=== Check kubescape vs kubescape_logs in apply.go OperatorOwnedTables ==="
cat src/vizier/services/adaptive_export/internal/clickhouse/apply.go 2>/dev/null | grep -A 30 "OperatorOwnedTables"

echo ""
echo "=== Check source_connector LOG_IF change ==="
grep -n "LOG_IF\|DFATAL\|PushData\|batch refused\|every_n" src/stirling/core/source_connector.cc 2>/dev/null | head -20

echo ""
echo "=== Check md_udtfs_impl for cluster_name consistency ==="
grep -n "cluster_name\|ClusterName\|ON CLUSTER\|ReplicatedMergeTree" src/vizier/funcs/md_udtfs/md_udtfs_impl.h 2>/dev/null | head -20

echo ""
echo "=== Check if px._pem_hostname is documented/tested ==="
grep -rn "_pem_hostname\|PEMHostname" --include="*.cc" --include="*.h" --include="*.pxl" | head -10

echo ""
echo "=== Check pxl/queryfor.go for IsBuiltin usage ==="
grep -n "IsBuiltin\|KnownTables\|BuiltinTables" src/vizier/services/adaptive_export/internal/pxl/queryfor.go 2>/dev/null | head -10

Length of output: 26639


@entlein

High-level summary

  • Scope: I reviewed all changes outside .local as requested. The PR lands a substantial adaptive-write/perf-tool refactor, ClickHouse schema + sink/operator surfaces, e2e suites/workflows, and a few C++/Stirling/K8s utility tweaks.
  • Overall: The wiring is coherent and most cross-file API changes are consistently updated. I found a handful of consistency nits and a couple of small but concrete fixes you can land quickly.

Potential blockers or must-fix consistency items

  1. “13 tables” vs 12 tables inconsistency
  • pxl/tables.go says “enumerates the 13 pixie socket_tracer tables,” but BuiltinTables has 12 entries (http_events, http2_messages.beta, dns_events, redis_events, mysql_events, pgsql_events, cql_events, mongodb_events, kafka_events.beta, amqp_events, mux_events, tls_events). Tests explicitly guard for 12.
  • clickhouse/ddl.go has a similar header comment claiming 13 tables; PixieTables() also returns 12.
    Fix: adjust comments to 12 to match the code and tests.
  1. New operator-owned table trigger_watermark
  • KnownTables and OperatorOwnedTables include trigger_watermark (in addition to adaptive_attribution). Ensure any tests or docs that assert “last table is adaptive_attribution” are updated to reflect the new tail item, and that schema.sql contains the DDL (it appears wired via KnownTables/DDL()).
  1. Unused include after UINT128 formatting change
  • src/carnot/exec/clickhouse_export_sink_node.cc still includes sole.hpp, but sole is no longer used after switching UINT128 -> "high:low" strings. Please remove the include to avoid compile warnings.
  1. Exporter interface assertions
  • parquet_local_exporter.go has var _ Exporter = (*ParquetLocalExporter)(nil). The BQ and Parquet GCS exporters don’t have compile-time assertions. Adding them makes refactors safer.

Important consistency checks that look good

  • Callers updated for signature changes:
    • deploy.NewWorkload(pxCtx, containerRegistryRepo, skaffoldStderrFile, …) used at run.go Line 375 and Line 386.
    • metrics.NewMetricsRecorder now returns (Recorder, error) and callers handle errors (run.go Line 260).
    • steps.NewSkaffoldDeploy receives stderrFile via NewWorkload.
  • prom_recorder_override name matches suites constants; workflow passes 'clickhouse-operator' as expected.
  • SOC_VIZIER_EXISTING=1 is supported in sovereign_soc.go and used by perf_soc_attack.yaml.
  • K8s apply: scope detection now uses REST mapper scope; delete: skips aggregated APIServices — both are sound improvements for cluster hygiene.
  • The ClickHouse sink validates table identifiers to prevent SQL injection and provides a robust HTTP path.
  • Stirling DFATAL→LOG_IF_EVERY_N(ERROR, 100) change addresses aborts under backpressure without hiding errors.

Nice-to-have improvements and small follow-ups

  • go.mod vs Bazel protobuf versions: require google.golang.org/protobuf v1.34.2 but replace to v1.29.1, and go_deps.bzl also sets a replace. If intentional (to keep generated code compatible), consider documenting why we pin to v1.29.1. Otherwise, align require/replace to the same version to avoid confusion.
  • ClickHouseReadLoadWorkload uses a placeholder Skaffold path (src/e2e_test/clickhouse_read_load/skaffold.yaml). The “clickhouse-export” workflow doesn’t exercise it, but the “clickhouse-read” experiment would fail to deploy. Either gate that experiment from suites until the path exists, or add a sanity check that errors early with a helpful message.
  • Parquet exporters: your constructors enforce batchSize > 0 (good). If createExporter allows omitting batch size via CLI, set a sane default (e.g., 1k–5k) to prevent foot-guns.
  • Docs/comments: Update the “13 tables” narrative in both pxl/tables.go and clickhouse/ddl.go headers; it’s the primary consistency nit I saw.
  • Tests: If you have a test that assumed OperatorOwnedTables ends with adaptive_attribution (earlier versions did), change it to assert that adaptive_attribution and trigger_watermark are present in OperatorOwnedTables, and that kubescape tables are absent (you already guard kubescape elsewhere).

Targeted diffs to apply

  1. Fix the “13 tables” comments
  • src/vizier/services/adaptive_export/internal/pxl/tables.go
-// BuiltinTables enumerates the 13 pixie socket_tracer tables the
+// BuiltinTables enumerates the 12 pixie socket_tracer tables the
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
-// the 13 socket_tracer tables Pixie's retention plugin writes (which
+// the 12 socket_tracer tables Pixie's retention plugin writes (which
  1. Remove unused include in clickhouse export sink
  • src/carnot/exec/clickhouse_export_sink_node.cc
-#include <sole.hpp>
  1. Add compile-time interface assertions for exporters
  • src/e2e_test/perf_tool/pkg/exporter/bq_exporter.go
 // Close implements exporter.Exporter. No-op for BigQuery.
 func (e *BQExporter) Close() error { return nil }
+
+// Compile-time interface assertion.
+var _ Exporter = (*BQExporter)(nil)
  • src/e2e_test/perf_tool/pkg/exporter/parquet_exporter.go
 type ParquetGCSExporter struct {
   // ...
 }
 
 // Close implements exporter.Exporter.
 func (e *ParquetGCSExporter) Close() error { return e.client.Close() }
+
+// Compile-time interface assertion.
+var _ Exporter = (*ParquetGCSExporter)(nil)

Optional: default batch size guard in createExporter (if not already there)

  • src/e2e_test/perf_tool/cmd/run.go (inside createExporter)
-  // use provided batchSize
+  // Default parquet batch size if not provided
+  if batchSize <= 0 {
+    batchSize = 2000
+  }

Compatibility notes to be aware of

  • UINT128 “high:low” string export: verify any downstream readers/parsers aren’t expecting UUID-encoded strings. If docs/examples mention UUIDs for these values, update them alongside this change.
  • CreateClickHouseSchemas UDTF gained a cluster_name init arg. If any existing PXL scripts call this UDTF without the new arg, they may need updating.

Verdict

  • With the small nits above (comments, an unused include, interface assertions, and ensuring test/docs reflect trigger_watermark), this is in good shape. The refactors hold together and the workflows/suites align with the code.

Would you like me to open a couple of follow-up issues (docs/cleanup: “12 vs 13 tables”; add exporter assertions; ClickHouseReadLoadWorkload placeholder gating), or push a tiny patch with the diffs above?

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.local/adaptive-write-rev3-plan.md:
- Line 22: There are unlabeled fenced code blocks (triple-backtick fences) in
the document that trigger MD040; locate the unlabeled fences and add an explicit
language tag—use the plain text tag "text" for prose/code snippets and "mermaid"
for diagram blocks—so each opening fence includes the appropriate language
identifier.

In @.local/deploy-patched-operator.sh:
- Line 11: The script leaves the TAG variable empty, causing invalid image
references like "adaptive_export:" during docker tag/save; update the script to
require or default TAG before image operations by checking the TAG variable
(e.g., in the top-level variable TAG) and either set a sensible default value or
exit with an error if it's empty, then use that validated TAG when creating
image references for the docker tag/save steps (the lines that construct
"adaptive_export:${TAG}" and related commands).

In @.local/deploy-patched-operator.sh.bak:
- Line 5: The script currently uses "set -uo pipefail" which does not enable
errexit; update the shell options so failures abort the script by adding errexit
(e) to that invocation (e.g., change the options used in the existing set
command that currently reads "set -uo pipefail" to include -e), ensuring any
non-zero command in the deployment flow stops execution immediately.

In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 181-184: The startup currently calls pixie.NewClient(ctx,
cfg.Pixie().APIKey(), cfg.Pixie().Host()) and fatally exits on error via
log.WithError(err).Fatal, which makes optional Pixie cloud integration
mandatory; change this to log a warning (e.g., log.WithError(err).Warn("..."))
and allow pluginClient to remain nil so startup continues in direct-mode; ensure
any later use of pluginClient (the variable returned by pixie.NewClient) checks
for nil before invoking methods. Apply the same change to the other Pixie client
creation block(s) referenced (replace Fatal with Warn and add nil-safety checks
where pluginClient is used).

In `@src/vizier/services/adaptive_export/internal/activeset/activeset.go`:
- Around line 150-183: Add a race-free combined subscribe+snapshot helper (e.g.
SubscribeAndSnapshot or SubscribeWithSnapshot) that atomically registers the
subscriber channel and returns the current membership slice plus version: take
subsMu then mu (or use a single lock ordering) to append the new buffered
channel into s.subs, then build the members slice and capture s.version before
releasing locks so no membership change can be missed between Subscribe and
Snapshot; accept the same buffer int as Subscribe, ensure returned version is
the exact s.version to allow callers to ignore deltas <= that version, and keep
Subscribe as-is for backwards compatibility.
- Around line 92-110: The Upsert method currently increments s.version for
t_end-only extensions causing false version gaps; change it so s.version (and v)
is only incremented/assigned when membership actually changes (i.e., when
!existed and a new key is inserted). Concretely, inside Upsert, keep the mutex
logic and the early-return for non-extensions, but move the s.version++ and v :=
s.version into the branch that handles new additions (the !existed path) while
still reading/updating s.members under s.mu; then release the lock and call
s.broadcast(Delta{Added: []Key{k}, Version: v}) only when a new member was
added. Ensure no version change occurs for the extension case (existed && tEnd >
prev).

In `@src/vizier/services/adaptive_export/internal/clickhouse/ddl.go`:
- Around line 17-20: Update the stale "13" socket_tracer count to "12" in the
package-level comment and the KnownTables comment so the text matches the actual
registry; locate the package comment at the top of
src/vizier/services/adaptive_export/internal/clickhouse/ddl.go and the
KnownTables comment (referencing KnownTables) and change any instances of "13"
to "12".

In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql`:
- Around line 73-76: Update the header comment that currently reads "13 Pixie
socket_tracer tables" to "12 Pixie socket_tracer tables" so it matches the
actual number of tables defined below; ensure the surrounding comment about
namespace + pod population via px.upid_to_namespace / px.upid_to_pod_name
remains unchanged so the SQL header and any Go table registry stay in sync.
- Around line 52-70: kubescape_logs currently defines event_time as UInt64 but
uses toDateTime(event_time) in PARTITION BY and TTL which misinterprets
milliseconds/nanoseconds; match other tables by making event_time a proper
DateTime64(3, 'UTC') (or add a computed normalized column) and use that for
ORDER/BY, PARTITION BY and TTL. Update the CREATE TABLE for
forensic_db.kubescape_logs to either (A) change the event_time column type from
UInt64 to DateTime64(3, 'UTC') and replace toDateTime(event_time) usages with
the column directly, or (B) add a computed column (e.g., normalized_event_time
DateTime64(3, 'UTC') AS toDateTime(eventTimeToTime(event_time) or a
normalization expression) and change ORDER BY, PARTITION BY and TTL to use
normalized_event_time; reference kubescape_logs.event_time and the controller
function eventTimeToTime() when implementing the normalization.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 152-156: The OnPrune callback is being invoked for each expired
anomaly hash even when other hashes for the same (namespace, pod) remain active;
update the pruning logic (where c.active is modified and PruneExpired is
handled) to only call OnPrune(namespace, pod) after you confirm no remaining
entries in c.active refer to that same namespace/pod (i.e., check remaining
values in c.active for matching namespace and pod or maintain a
per-(namespace,pod) reference/count); ensure this change touches the code paths
that currently call OnPrune (the PruneExpired handling and the related block
around c.active manipulation) so pods are only evicted from the rev-3 ActiveSet
when the last active hash for that pod is removed.
- Around line 200-205: The empty-result suppression cache (emptyStreak and
emptySkipUntil guarded by emptyCacheMu) currently builds keys as "pod|table"
which omits the namespace, causing different namespaces on the same node to
share suppression state; update the key construction to include namespace (e.g.
"namespace|pod|table" or otherwise include ns) wherever the key is created/used
(search for code that concatenates pod and table into a key for
emptyStreak/emptySkipUntil) so each (namespace,pod,table) is tracked separately
and adjust any helper/lookup code that reads or writes those maps accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: d675041f-bd68-4477-a051-afe2fdeff456

📥 Commits

Reviewing files that changed from the base of the PR and between 023c086 and 8f94949.

📒 Files selected for processing (34)
  • .github/workflows/perf_soc_attack.yaml
  • .local/adaptive-export-architecture.mmd
  • .local/adaptive-write-rev3-architecture.mmd
  • .local/adaptive-write-rev3-learnings.md
  • .local/adaptive-write-rev3-plan.md
  • .local/deploy-patched-operator.sh
  • .local/deploy-patched-operator.sh.bak
  • k8s/vizier/bootstrap/adaptive_export_deployment.yaml
  • src/e2e_test/perf_tool/pkg/suites/suites.go
  • src/e2e_test/protocol_loadtest/pgsql_client/main.go
  • src/e2e_test/vizier/seq_tests/client/pkg/pgsqlclient/client.go
  • src/vizier/services/adaptive_export/cmd/BUILD.bazel
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/activeset/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/activeset/activeset.go
  • src/vizier/services/adaptive_export/internal/activeset/activeset_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/apply.go
  • src/vizier/services/adaptive_export/internal/clickhouse/apply_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
  • src/vizier/services/adaptive_export/internal/clickhouse/schema.sql
  • src/vizier/services/adaptive_export/internal/controller/controller.go
  • src/vizier/services/adaptive_export/internal/sink/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/sink/clickhouse.go
  • src/vizier/services/adaptive_export/internal/streaming/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/streaming/filter.go
  • src/vizier/services/adaptive_export/internal/streaming/filter_test.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner_test.go
  • src/vizier/services/adaptive_export/internal/streaming/supervisor.go
  • src/vizier/services/adaptive_export/internal/streaming/writer.go
  • src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark_test.go


## Components

```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fence languages for the unlabeled code blocks.

At Line 22 and Line 119, fenced blocks are missing a language tag (MD040). Add text or mermaid as appropriate to keep markdownlint clean.

Also applies to: 119-119

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.local/adaptive-write-rev3-plan.md at line 22, There are unlabeled fenced
code blocks (triple-backtick fences) in the document that trigger MD040; locate
the unlabeled fences and add an explicit language tag—use the plain text tag
"text" for prose/code snippets and "mermaid" for diagram blocks—so each opening
fence includes the appropriate language identifier.

Comment thread .local/deploy-patched-operator.sh Outdated
# deploy-patched-operator.sh — bazel-build adaptive_export with our two
# patches (prune-grace + 180s gRPC timeout in controller.go) and roll the
# deployment onto the new image. Idempotent.
set -uo pipefail
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enable errexit so failed build/deploy steps cannot silently continue.

At Line 5, set -uo pipefail still allows the script to continue after a non-zero command. In this deployment flow, that can roll out an old/bad image.

Proposed fix
-set -uo pipefail
+set -euo pipefail
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -uo pipefail
set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.local/deploy-patched-operator.sh.bak at line 5, The script currently uses
"set -uo pipefail" which does not enable errexit; update the shell options so
failures abort the script by adding errexit (e) to that invocation (e.g., change
the options used in the existing set command that currently reads "set -uo
pipefail" to include -e), ensuring any non-zero command in the deployment flow
stops execution immediately.

Comment thread src/vizier/services/adaptive_export/cmd/main.go
Comment thread src/vizier/services/adaptive_export/internal/activeset/activeset.go
Comment thread src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
Comment on lines +52 to +70
CREATE TABLE IF NOT EXISTS forensic_db.kubescape_logs (
BaseRuntimeMetadata String,
CloudMetadata String,
RuleID String,
RuntimeK8sDetails String,
RuntimeProcessDetails String,
event String,
event_time UInt64,
hostname String,
level String DEFAULT '',
message String DEFAULT '',
msg String DEFAULT '',
processtree_depth String DEFAULT '',
anomaly_hash String DEFAULT ''
) ENGINE = MergeTree()
ORDER BY (event_time, hostname)
PARTITION BY toYYYYMM(toDateTime(event_time))
TTL toDateTime(event_time) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n -C2 'eventTimeToTime|seconds|millis|nanos|event_time' \
  src/vizier/services/adaptive_export/internal/controller/controller.go \
  src/vizier/services/adaptive_export/internal/clickhouse/schema.sql

Repository: k8sstormcenter/pixie

Length of output: 17700


Fix kubescape_logs.event_time to use consistent datetime type.

kubescape_logs.event_time is stored as UInt64 but the PARTITION BY and TTL clauses invoke toDateTime(event_time) without normalizing the unit first. The controller's eventTimeToTime() function (lines 619–627) explicitly auto-detects whether the value is in seconds, millis, or nanos based on magnitude. If a row contains milliseconds, toDateTime() will misinterpret it as seconds, producing incorrect partitioning and potentially incorrect TTL expiration.

All other Pixie observation tables use DateTime64(3, 'UTC') for event_time; kubescape_logs should match. Either convert event_time to DateTime64(3, 'UTC') at schema level, or add a normalized computed column and use that for PARTITION BY and TTL.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql` around
lines 52 - 70, kubescape_logs currently defines event_time as UInt64 but uses
toDateTime(event_time) in PARTITION BY and TTL which misinterprets
milliseconds/nanoseconds; match other tables by making event_time a proper
DateTime64(3, 'UTC') (or add a computed normalized column) and use that for
ORDER/BY, PARTITION BY and TTL. Update the CREATE TABLE for
forensic_db.kubescape_logs to either (A) change the event_time column type from
UInt64 to DateTime64(3, 'UTC') and replace toDateTime(event_time) usages with
the column directly, or (B) add a computed column (e.g., normalized_event_time
DateTime64(3, 'UTC') AS toDateTime(eventTimeToTime(event_time) or a
normalization expression) and change ORDER BY, PARTITION BY and TTL to use
normalized_event_time; reference kubescape_logs.event_time and the controller
function eventTimeToTime() when implementing the normalization.

Comment thread src/vizier/services/adaptive_export/internal/clickhouse/schema.sql Outdated
@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 21:46 — with GitHub Actions Inactive
@entlein
Copy link
Copy Markdown
Author

entlein commented May 17, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@entlein entlein temporarily deployed to pr-actions-approval May 17, 2026 22:03 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 18, 2026 09:29 — with GitHub Actions Inactive
entlein added 4 commits May 18, 2026 13:46
- controller: skip-cache key now (namespace, pod, table) so same-named pods in
  different namespaces no longer share suppression state
- pxl/queryfor: pod-only filter switched to regex match `^[^/]+/<pod>$`
  because px.upid_to_pod_name returns "<ns>/<pod>" — bare-pod equality always
  missed
- pxl/tables: BuiltinTables → unexported builtinTables; Builtins() and
  Registry.Tables() return defensive copies (with mutation regression tests)
- anomaly/hash: length-prefixed canonical encoding instead of ":"-join so
  inputs containing ":" or empty values can't collide (with regression test);
  Target NumField() guard added so adding a field fails loudly
- pixieapi.NewDirect: now returns error if PX_DISABLE_TLS=1 missing for a
  cluster.local addr instead of pxapi log.Fatal'ing at first Query
- cmd: pixie cloud client failure downgraded to Warn (was Fatal); plugin
  enablement and preset install steps are skipped when the client is nil
- ddl/schema: "13 socket_tracer tables" → "12" everywhere
- clickhouse/integration_test: separate ctx per Apply() call in idempotency
  test to avoid context expiry masking the property
- clickhouse/apply_test: assert NewApplier error in every test (was _ :=)
- e2e_test: wait for controller goroutine drain after cancel
- .arclint: exclude .local/ from flake8 (linter-level, not config-level —
  the latter is ignored when arc passes files explicitly) and add the
  helm-rendered/ exclusion to keep machine-generated YAML out of scope
- .flake8rc: same .local/ exclusion (kept in sync with .arclint)
- workflows/yaml under perf_tool/sovereign-soc: yamllint indentation
  fixes (indent-sequences=false) + colon/comma normalization
- bob-suite-attack-cm.yaml: yamllint disable rule:line-length at file
  head — the embedded Lua/perl attack one-liners are intentionally on
  one source line so the bytes the eBPF detector sees match the
  upstream expectedDetections fixture
- workflows: split long tailscale-probe lines for line-length
- .local/render-*.py: typing.Any annotations + dict-shape pins for
  mypy 1.20+ on the sweep render scripts
- .local/deploy-patched-operator.sh: TAG=rev3-cr-fixes-2 so the build
  shipped with the CR-fix work has a stable tag
- .arclint: exclude .local/ from mypy linter — same as flake8 already
  excludes. CI's mypy 1.20.2 INTERNAL ERROR's on .local/render-matrix3.py
  (local mypy 1.20.2 doesn't reproduce, so the crash is a CI-side mypy
  binary issue; rather than chase a moving target on dev-only scripts,
  drop them from the linter the same way flake8 does)
- src/common/system/BUILD.bazel: scoped_namespace_test timeout
  "default(moderate)" → "long" + flaky=True. Timed out after 180.4s on
  BPF opt (6.1.106); root-required namespace setup under qemu-bpf is
  visibly racy
- src/carnot/exec/BUILD.bazel: clickhouse_source_node_test and
  clickhouse_export_sink_node_test now flaky=True. Both fail fast
  (12-16s) on both kernel matrix entries — pulls a 636 MiB clickhouse
  image then races a server start, so two attempts is a reasonable hedge
- src/stirling/source_connectors/dynamic_tracer/BUILD.bazel:
  dynamic_trace_bpf_test moderate→long + flaky=True. Failed once on
  6.1.106 only; same kind of slow-container-start flake we've seen on
  socket_tracer's BPF tests already (handled the same way there)

# Addresses CodeRabbit comments on PR #38
# (most flagged behaviour was already fixed in 0c6caec; the rabbit was
#  looking at the pre-fix tip 0c8bd1e):
#
#   src/vizier/services/adaptive_export/internal/controller/controller.go:200-205
#     Skip-cache key now includes namespace ("ns|pod|table").
#     Implemented in 0c6caec; regression test:
#     TestEmptyResultSkip_NamespaceIsolation.
#
#   src/vizier/services/adaptive_export/internal/controller/controller.go:152-156
#     OnPrune fires only after the LAST hash for that (ns,pod) is gone.
#     Implemented in 0c6caec; regression tests:
#     TestController_OnPrune_OnlyFiresWhenLastHashOnPodGone,
#     TestController_OnPrune_DoesNotFireWhileOtherHashesActive.
@entlein entlein temporarily deployed to pr-actions-approval May 18, 2026 16:37 — with GitHub Actions Inactive
Templated PxL scripts (e.g. healthcheck/redis_data_in_namespace.pxl uses
Go {{.Namespace}}) aren't valid Python — CI's mypy 1.20.2 INTERNAL ERROR's
trying to parse them. flake8-pxl already covers PxL via its own config
(.pxl.flake8rc); mypy was duplicating that include for no real benefit.

Reverts the (\.pxl$) include I added together with the .local/ exclude
in 0d09a8c; keeps the .local/ exclude.
@entlein entlein deployed to pr-actions-approval May 18, 2026 18:04 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants