CAVEconnectome
diff --git a/‎CLAUDE.md‎
Lines changed: 21 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎pychunkedgraph/debug/stitch_test/SESSION.md‎
Lines changed: 99 additions & 0 deletions b/‎pychunkedgraph/debug/stitch_test/SESSION.md‎
Lines changed: 99 additions & 0 deletions
diff --git a/‎pychunkedgraph/debug/stitch_test/__init__.py‎
Lines changed: 5 additions & 0 deletions b/‎pychunkedgraph/debug/stitch_test/__init__.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎pychunkedgraph/debug/stitch_test/compare.py‎
Lines changed: 175 additions & 0 deletions b/‎pychunkedgraph/debug/stitch_test/compare.py‎
Lines changed: 175 additions & 0 deletions
diff --git a/‎pychunkedgraph/debug/stitch_test/current.py‎
Lines changed: 68 additions & 0 deletions b/‎pychunkedgraph/debug/stitch_test/current.py‎
Lines changed: 68 additions & 0 deletions
@@ -0,0 +1,21 @@
+# Before Every Response
+- NEVER describe how code works without reading it first (Read/Grep). If you didn't use a tool to check, say "I haven't verified this" or check first.
+- NEVER use nested/inline imports. ALL imports go at the top of the file. Check your edits for this before submitting.
+- Never remove breakpoints or uncomment code that was left commented out.
+
+# Rules
+- Do what's asked, nothing more/less. NEVER create files unless absolutely necessary.
+- NEVER add comments about what code used to be or what was moved/removed.
+- Follow instructions precisely. If asked to implement but not integrate, don't integrate.
+- NEVER use unittest mocks — only mocker fixture.
+- Always write vectorized numpy — no Python loops over arrays.
+- Keep notebooks simple — short function calls only, all logic in modules.
+- No patchwork — design complete algorithms from first principles.
+- No fat VM solutions — hard constraint.
+- Never create git commits — user commits themselves.
+- Never modify user's code without asking first.
+- Test code end-to-end before presenting.
+- Terse responses — no trailing summaries.
+
+# Project Context
+Read `pychunkedgraph/debug/stitch_test/SESSION.md` for full stitch redesign context.
@@ -0,0 +1,99 @@
+# Stitch Redesign — Session Context
+
+For a new Claude session to pick up this work, tell it:
+> Read pychunkedgraph/debug/stitch_test/SESSION.md and design.md to understand the stitch redesign state.
+
+## Key files
+
+- `.env/stitching/design.md` — high-level algorithm design doc
+- `pychunkedgraph/debug/stitch_test/design.md` — detailed algorithm description with phase breakdowns
+- `pychunkedgraph/debug/stitch_test/proposed.py` — the proposed stitch implementation
+- `pychunkedgraph/debug/stitch_test/wave.py` — unified test runner (single/wave/multiwave experiments)
+- `pychunkedgraph/debug/stitch_test/utils.py` — structure extraction, batched parallel extraction, comparison functions
+- `pychunkedgraph/debug/stitch_test/compare.py` — orchestration, persistence helpers
+- `pychunkedgraph/debug/stitch_test/current.py` — wrapper for current `add_edges` baseline
+- `pychunkedgraph/debug/stitch_test/tables.py` — BigTable backup/restore, env setup, autoscaling
+- `.env/stitching/hsmith_mec.ipynb` — test notebook
+
+## Module dependency order (no cycles)
+
+tables → utils → {current, proposed} → compare → wave
+
+- `utils.py` has pure functions: extract_structure, _compare_*, _convert_for_json, batched extraction, SV-based comparison
+- `compare.py` has orchestration + persistence: imports from current, proposed, utils
+- Never import from compare into utils
+
+## Current status (2026-03-23)
+
+### What works
+- Proposed algorithm implemented and structurally correct (single file match verified)
+- Single file test: proposed ~151s vs current ~205s (1.35x speedup on this VM)
+- Wave 0 current baseline: 606 files, 311K roots, ~1050s wall with 512 workers
+- Wave 0 proposed: completed 638s wall (1.64x speedup), structural comparison pending (comparison bug fixed, needs re-run)
+
+### Extraction and comparison design
+- **SV-based components**: `extract_structure` resolves L2 → SVs so components are frozensets of SV IDs (stable across tables, order-independent)
+- **Compressed storage**: `np.savez_compressed` with flat arrays + offsets for variable-length SV sets
+- **Independent extraction**: each side extracted into its own subdirectory (`current/`, `proposed/`)
+- **Order-independent comparison**: uses sets of frozensets per layer, not sorted lists. No shard-to-shard matching needed.
+- **No table deletion**: user manages table cleanup via prefix
+
+### Retry safety
+- **`_get_all_parents_filtered`**: replaces `get_all_parents_dict_multiple` for stitching. Applies `filter_failed_node_ids` at every layer during parent chain traversal to detect and remap orphaned nodes from prior failed stitch attempts.
+- **`filter_failed_node_ids`** applied to both `l2ids` and `l2_siblings` after reading their children.
+- **Two-phase writes**: `_build_entries` returns `(node_entries, parent_entries)`. Node rows written first, then Parent pointers. Ensures Parent pointers only reference rows that exist.
+- **No FormerParent**: proposed path does not write FormerParent/deprecation entries.
+- **Crash recovery**: `stitch_results.json` saved immediately after stitch completes. Pass `run_id` to resume.
+- **Fresh runs**: `_clear_log_dir` deletes old results before restoring table.
+
+### Architecture decisions
+- **No neighbor CrossChunkEdge updates**: stale is OK — future proposed stitches read AtomicCrossChunkEdge (immutable) + Parent + Child.
+- **No locks**: lock-free, enables true parallelism within waves.
+- **No table deletion**: user manages cleanup.
+
+### Test infrastructure
+- **Entry points**: `run_current(experiment)` and `run_proposed_and_compare(experiment, run_id=None)`
+- **Experiment types**: "single" (one file), "wave" (wave 0), "multiwave" (all waves)
+- **Extraction**: 500K root batches, sharded across cpu_count workers, each saves own .npz
+- **Retries**: tenacity on extraction reads (3 attempts, exponential backoff)
+- **Workers**: `min(n_files, 4 * cpu_count)` for wave processing
+- **Progress**: tqdm for wave file processing
+- **Autoscaling**: for wave/multiwave, sets BigTable CPU target to 25% before, reverts to 60% after (in `finally`).
+
+### Performance
+
+**Single file (task_0_0.edges, 1024 edges)**:
+- Proposed ~151s vs current ~205s (1.35x)
+
+**Wave 0 (606 files, 311K roots)**:
+- Current: ~1050s wall
+- Proposed: ~638s wall (1.64x)
+- Proposed per-file: mean=245s, median=272s, p95=295s, max=399s (task_0_591.edges)
+
+### Remaining work
+- Re-run wave 0 comparison with fixed SV-based extraction
+- Add incremental file result saving during wave runs
+- Optimize proposed further (straggler task_0_591 took 399s)
+- Run multiwave test once wave 0 validates
+
+## User preferences (critical)
+- **Never describe how code works without reading it first** — use Read/Grep, or say "I haven't verified this"
+- **Never use nested/inline imports** — all imports at module top level, design modules to avoid circular deps
+- **Never create commits** — user does them
+- **Vectorized numpy** — no Python loops where numpy works
+- **Keep notebooks simple** — short function calls only, all logic in modules
+- **No patchwork** — design complete algorithms from first principles
+- **No fat VMs** — hard constraint
+- **Max effort always**
+- **No mocks** — only mocker fixture
+- **Test end-to-end before presenting**
+- **Never modify user's code without asking**
+- **Terse responses** — no trailing summaries
+- **Never delete tables** — user manages cleanup via prefix
+
+## Dataset
+- **hsmith_mec**: 7 layers, ~600k edges, 1095 total files
+  - Wave 0: 606 files
+  - Edge source: `gs://dodam_exp/hammerschmith_mec/100GVx_cutout/proofreadable_exp16_0.26/agg_chunk_ext_edges`
+  - Backup table: `hsmith-mec-100gvx-exp16-0.26-backup`
+  - BigTable project: `zetta-proofreading`, instance: `pychunkedgraph`
@@ -0,0 +1,5 @@
+from .tables import restore_test_table
+from .inspect import inspect_stitch_edges, inspect_l2_cross_edges, inspect_hierarchy
+from .current import run_current_stitch
+from .proposed import run_proposed_stitch
+from .wave import run_current, run_proposed_and_compare, list_wave_files, list_all_waves
@@ -0,0 +1,175 @@
+import json
+import pickle
+import secrets
+from datetime import datetime, timezone
+from pathlib import Path
+
+import numpy as np
+from cloudfiles import CloudFile
+
+from pychunkedgraph.graph import basetypes
+
+from .current import run_current_stitch
+from .proposed import run_proposed_stitch
+from .tables import restore_test_table, setup_env, PREFIX, EDGES_SRC, _get_instance
+from .utils import _compare_components, _compare_cross_edges, _convert_for_json
+
+LOGS_ROOT = Path("/home/akhilesh/opt/zetta_utils/.env/pcg/.env/stitching/runs")
+
+
+def generate_run_id() -> str:
+    return secrets.token_hex(4)
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Top-level API
+# ─────────────────────────────────────────────────────────────────────
+
+
+def run_current_baseline(experiment: str = "single", edge_file: str = None):
+    """
+    Run the current stitch path once for an experiment type.
+    If the table + saved results already exist, skips and prints "reusing".
+    """
+    setup_env()
+    if edge_file is None:
+        edge_file = f"{EDGES_SRC}/task_0_0.edges"
+
+    table_name = f"{PREFIX}hsmith_mec_current_{experiment}"
+    log_dir = LOGS_ROOT / experiment / "current"
+    log_dir.mkdir(parents=True, exist_ok=True)
+    structure_path = log_dir / "current_structure.json"
+
+    instance = _get_instance()
+    if instance.table(table_name).exists() and structure_path.exists():
+        print(f"reusing {table_name}")
+        return
+
+    print(f"restoring and running current path for '{experiment}'")
+    restore_test_table(table_name)
+    edges = pickle.loads(CloudFile(edge_file).get())
+    edges = np.asarray(edges, dtype=basetypes.NODE_ID)
+    result = run_current_stitch(table_name, edges, do_sanity_check=False)
+    _save_run_result(log_dir, "current", result)
+    print(f"current {experiment} done: {result['elapsed']:.1f}s")
+
+
+def run_proposed_and_compare(experiment: str = "single", edge_file: str = None):
+    """
+    Run the proposed stitch path and compare against the current baseline.
+    Returns (match, result_current, result_proposed).
+    """
+    setup_env()
+    if edge_file is None:
+        edge_file = f"{EDGES_SRC}/task_0_0.edges"
+
+    run_id = generate_run_id()
+    log_dir = LOGS_ROOT / experiment / run_id
+    log_dir.mkdir(parents=True, exist_ok=True)
+    table_proposed = f"{PREFIX}hsmith_mec_{run_id}_proposed"
+
+    print(f"run_id: {run_id}")
+    print(f"logs: {log_dir}")
+
+    current_log_dir = LOGS_ROOT / experiment / "current"
+    result_current = _load_result(current_log_dir, "current")
+
+    restore_test_table(table_proposed)
+    edges = pickle.loads(CloudFile(edge_file).get())
+    edges = np.asarray(edges, dtype=basetypes.NODE_ID)
+    result_proposed = run_proposed_stitch(table_proposed, edges)
+    _save_run_result(log_dir, "proposed", result_proposed)
+
+    print(f"\ncurrent: {result_current['elapsed']:.1f}s, proposed: {result_proposed['elapsed']:.1f}s")
+    match = compare_stitch_results(result_current, result_proposed)
+
+    summary = {
+        "run_id": run_id,
+        "experiment": experiment,
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "edge_file": edge_file,
+        "match": match,
+        "time_current": result_current["elapsed"],
+        "time_proposed": result_proposed["elapsed"],
+        "proposed_perf": result_proposed.get("perf", {}),
+    }
+    with open(log_dir / "summary.json", "w") as f:
+        json.dump(_convert_for_json(summary), f, indent=2)
+
+    print(f"\n{'MATCH' if match else 'MISMATCH'}")
+    return match, result_current, result_proposed
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Comparison
+# ─────────────────────────────────────────────────────────────────────
+
+
+def compare_stitch_results(result_a: dict, result_b: dict) -> bool:
+    ids_match = _compare_new_ids_per_layer(result_a, result_b)
+    comp_match = _compare_components(result_a["structure"], result_b["structure"])
+    cx_match = _compare_cross_edges(result_a["structure"], result_b["structure"])
+    return ids_match and comp_match and cx_match
+
+
+def _compare_new_ids_per_layer(result_a, result_b):
+    lc_a = {int(k): v for k, v in result_a.get("layer_counts", {}).items()}
+    lc_b = {int(k): v for k, v in result_b.get("layer_counts", {}).items()}
+    all_layers = sorted(set(lc_a.keys()) | set(lc_b.keys()))
+    match = True
+    for layer in all_layers:
+        if lc_a.get(layer, 0) != lc_b.get(layer, 0):
+            print(f"  NEW IDS MISMATCH layer {layer}: {lc_a.get(layer,0)} vs {lc_b.get(layer,0)}")
+            match = False
+    if match:
+        print(f"  NEW IDS MATCH: {sum(lc_a.values())} across {len(all_layers)} layers")
+    return match
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Persistence helpers
+# ─────────────────────────────────────────────────────────────────────
+
+
+def _save_structure(log_dir, name, structure):
+    serializable = {}
+    comps = structure.get("components", {})
+    serializable["components"] = {
+        str(layer): [sorted(c) for c in ccs] for layer, ccs in comps.items()
+    }
+    cx = structure.get("cross_edges", {})
+    serializable["cross_edges"] = {
+        str(layer): [[sorted(src), sorted(dst)] for src, dst in pairs]
+        for layer, pairs in cx.items()
+    }
+    with open(log_dir / f"{name}_structure.json", "w") as f:
+        json.dump(_convert_for_json(serializable), f, indent=2)
+
+
+def _save_run_result(log_dir, name, result):
+    _save_structure(log_dir, name, result["structure"])
+    meta = {k: v for k, v in result.items() if k != "structure"}
+    with open(log_dir / f"{name}_meta.json", "w") as f:
+        json.dump(_convert_for_json(meta), f, indent=2)
+
+
+def _load_structure(path):
+    with open(path) as f:
+        data = json.load(f)
+    return {
+        "components": {
+            int(layer): [frozenset(c) for c in ccs]
+            for layer, ccs in data.get("components", {}).items()
+        },
+        "cross_edges": {
+            int(layer): [(frozenset(src), frozenset(dst)) for src, dst in pairs]
+            for layer, pairs in data.get("cross_edges", {}).items()
+        },
+    }
+
+
+def _load_result(log_dir, name):
+    with open(log_dir / f"{name}_meta.json") as f:
+        result = json.load(f)
+    result["structure"] = _load_structure(log_dir / f"{name}_structure.json")
+    return result
@@ -0,0 +1,68 @@
+import os
+import time
+
+os.environ["PCG_PROFILER_ENABLED"] = "1"
+
+import numpy as np
+
+import pychunkedgraph.debug.profiler as profiler_mod
+from pychunkedgraph.debug.profiler import HierarchicalProfiler
+from pychunkedgraph.graph import ChunkedGraph, basetypes
+from .utils import extract_structure
+
+
+def run_current_stitch(graph_id: str, atomic_edges: np.ndarray, do_sanity_check: bool = True) -> dict:
+    """
+    Run the existing add_edges stitch path on a graph copy.
+    Same calling convention as dist/internal/chunkedgraph/operations.py.
+    Returns dict with structural result and metadata.
+    """
+
+    class SilentProfiler(HierarchicalProfiler):
+        def print_report(self, *a, **kw):
+            pass
+
+    profiler_mod._profiler = SilentProfiler(enabled=True)
+
+    atomic_edges = np.asarray(atomic_edges, dtype=basetypes.NODE_ID)
+    cg = ChunkedGraph(graph_id=graph_id)
+
+    print(f"  [current] stitch ({len(atomic_edges)} edges)...")
+    t0 = time.time()
+    result = cg.add_edges(
+        user_id="test",
+        atomic_edges=atomic_edges,
+        stitch_mode=True,
+        allow_same_segment_merge=True,
+        do_sanity_check=do_sanity_check,
+    )
+    elapsed = time.time() - t0
+    new_roots = result.new_root_ids
+    new_l2_ids = result.new_lvl2_ids
+    print(f"  [current] stitch: {elapsed:.1f}s, {len(new_roots)} roots")
+
+    profiler = profiler_mod._profiler
+    perf = {}
+    for path, times in profiler.timings.items():
+        perf[path] = {
+            "total_ms": sum(times) * 1000,
+            "calls": profiler.call_counts[path],
+            "avg_ms": (sum(times) / profiler.call_counts[path]) * 1000,
+        }
+    profiler_mod._profiler = HierarchicalProfiler(enabled=False)
+
+    t0 = time.time()
+    structure = extract_structure(cg, new_roots)
+    print(f"  [current] structure: {time.time() - t0:.1f}s")
+
+    return {
+        "structure": structure,
+        "new_roots": new_roots.tolist(),
+        "new_l2_ids": [int(x) for x in new_l2_ids],
+        "operation_id": int(result.operation_id) if result.operation_id else None,
+        "elapsed": elapsed,
+        "graph_id": graph_id,
+        "n_edges": len(atomic_edges),
+        "layer_counts": {layer: len(ccs) for layer, ccs in structure["components"].items()},
+        "perf": perf,
+    }