Skip to content

C++ search#210

Draft
ms609 wants to merge 565 commits intomainfrom
cpp-search
Draft

C++ search#210
ms609 wants to merge 565 commits intomainfrom
cpp-search

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Mar 19, 2026

  • other optimizations + features

Manual testing underway; shiny app in particular has some usability issues.

@ms609 ms609 marked this pull request as draft March 25, 2026 14:21
ms609 added 29 commits March 25, 2026 16:38
…ilder

Build random binary tree that satisfies topological constraints by:
1. Ordering constraint splits smallest-to-largest
2. Assigning each tip to its tightest enclosing split
3. Bottom-up: randomly resolve each split's children into binary subtree
4. Wire root-level items (unconstrained tips + top-level splits)

Replaces Wagner fallback (T-214 workaround) for RANDOM_TREE strategy,
restoring uniform random topology sampling diversity under constraints.

Tests: 916 constraint + 24 random-constrained + 373 simplify/driven pass.
- Clarify MaddisonSlatkin() @examples (show logp intermediate)
- Point FixedDraws overflow error to StepInformation(approx='mc')
- Note MC fallback in MaddisonSlatkin Rd documentation

From uncommitted work on feature/madslatkin-profiling (PR #211).
Cherry-picked from feature/parallel-temper (6dc28a2). Replaces
static extract_divided_steps() copies with shared extract_char_steps()
in TBR, SPR, and drift. NA blocks now use three-pass correction
formula instead of raw local_cost.

ts_temper.cpp already correct (via PR #227 PCSA merge).
- T-196: cherry-picked NA+IW fix from feature/parallel-temper
- T-198-201: closed (PT ruled out; PCSA landed via PR #227)
- Removed PT section from to-do.md
- Deleted feature/parallel-temper and feature/pt-eval (remote+local)
- PT findings preserved in .positai/expertise/pt-evaluation.md
Three issues in impose_one_pass / impose_constraint:

1. Bail-out threshold n_tip/4 was too aggressive. For n_tip=5 the
   threshold was 1, so any split requiring 2+ moves caused an immediate
   bail-out before making any repairs. Raised to n_tip.

2. impose_one_pass returned 0 both for 'no violations' and 'bailed out',
   so impose_constraint couldn't distinguish the two. Now returns -1 on
   bail-out, allowing the caller to know the repair may be incomplete.

3. Documented the root-child limitation: spr_clip() doesn't fully handle
   root children, so impose_one_pass skips them. The fuse path's
   post-repair verification guard (T-214) catches these cases.

Tests: 940 constraint + 308 simplify/driven/sector pass.
spr_clip() can't detach root children (root is its own parent, so the
bypass logic fails). All search callers skip root children, but
impose_constraint needs them for constraint repair.

New topology_spr() helper in ts_constraint.cpp handles the root-child
case by absorbing the sibling into root and repurposing it as the
insertion node. No changes to spr_clip or any search caller.

Also removes the build_postorder call that was missing between
individual moves within impose_one_pass — each topology_spr is now
followed by a postorder rebuild so edge enumeration stays valid.

Tests: 942 constraint (90 impose-constraint, incl 2 new root-child
tests) + 308 simplify/driven/sector pass.
T-208/T-211: random_constrained_tree() and impose_constraint() fixes
Benchmarked perturbStopFactor across 10 morphobank/inapplicable datasets
(23-213 tips). Key findings:

- PSF=2 gives 2.4-6.9x speedup on converged searches with zero score loss
- Complementary to targetHits: on hard landscapes where few replicates
  hit the best score, PSF fires first; on easy landscapes targetHits
  fires first (PSF is irrelevant)
- PSF=5 provides smaller speedups and is too conservative for large trees

Changed default from 0 (disabled) to 2 in SearchControl() and compat
wrapper. Updated docs and fixed timeout test (needs PSF=0 to test the
timeout path, not convergence).
T-187: Perturbation-count stopping rule
Infrastructure for indirect scoring optimization:

1. FlatBlock struct (24 bytes/block vs 288 bytes in CharBlock) packs
   hot-loop metadata (offset, n_states, active_mask, has_inapplicable)
   for cache-friendly access. Populated at build_dataset() time.

2. Flat indirect scoring functions (EW and NA-aware variants) that use
   FlatBlock and skip upweight_mask/weight overhead. Available as
   fitch_indirect_{bounded,cached}_flat and fitch_na_indirect_
   {bounded,cached}_flat. NOT wired into search dispatch — see below.

3. Software prefetch in TBR rerooting inner loop: prefetch vroot_cache
   entry 2 iterations ahead. At 180+ tips (vroot_cache ~140 KB, L2),
   this hides ~10 cycle L2 latency. Negligible overhead at small sizes
   where vroot_cache fits in L1.

Benchmarking notes (Agnarsson 62t, Zhu 75t, Dikow 88t, 10 seeds each):
Flat dispatch (ternary or function pointer) showed no measurable benefit
at these sizes — hardware prefetching of the sequential CharBlock array
is already effective, and the dispatch overhead (extra branch or indirect
call) marginally increases code complexity in the hot path. System-level
timing variance on the test machine is ±15-30%, masking any sub-10% gain.

The flat functions are retained as available infrastructure for large-tree
optimization (180+ tips) where CharBlock cache traffic may become
significant. They can be wired in via function pointers when a 180+ tip
benchmark is available for validation.

All 2877 ts-* tests pass with identical scores.
TNT finds better XPIWE trees than TreeSearch on Vinther2008
(3.79283 vs 3.80000). TreeSearch optimal tree has no characters
with both missing data and h>=2, so XPIWE=IW on that tree,
but the search should explore differently.
…to WORDLIST

Pre-existing issues blocking GHA on cpp-search HEAD.
Regenerated via roxygen2::roxygenise(load_code = load_installed).
Hamilton HPC benchmark (mbank_X30754, 180t, EPYC 7702, 5 seeds):
- AC=1: 400ms/rep, 40% within-replicate hit rate
- AC=3: 1370ms/rep, 21% hit rate, no significant score gain (p>0.5)
- AC=0 vs AC=3: also no significant difference
AC=1 saves ~1s/rep (~6% of 17s median) with no quality loss.

Also adds hamilton-hpc project skill for remote benchmarking docs.
TNT is 32-bit i386 with zero SIMD and 64KB LUT popcount.
TreeSearch has ~4x throughput advantage (128-bit SSE2).
TNT's 3-5x convergence speed is strategic, not implementation.
Added T-249..T-253 investigation tasks to to-do.md.
Same approach as TreeDist::popcnt64: emit popcnt instruction via inline
asm on GCC/Clang x86-64, __popcnt64 on MSVC. Software Hamming weight
fallback for non-x86-64 platforms. CRAN-compatible (no compile flag change).

Old: __builtin_popcountll → compiler emits ~10-instruction shift-mask
New: single popcnt instruction (92 occurrences in compiled DLL)

Also updated T-251 description to focus on candidates-per-improvement.
The TNT download page labels the Windows build as '[32 bits]'.
The ~4x throughput advantage finding applies only to the Windows
32-bit binary and should not be generalized to Hamilton (Linux 64-bit)
benchmarks.
… identified

Trajectory comparison on 3 gap datasets (Geisler2001, Zhu2013, Wortley2006)
at 30s budgets, 3 seeds each. Key findings:

1. Drift consumes 16-23% of wall time but gains <1% of score improvement
   (405-1498 ms/step vs 0.4-44 ms/step for other phases). 30-170x less
   efficient than the next-worst phase (ratchet).

2. TNT evaluates 1.5-3.6x more rearrangements/second than TreeSearch
   despite 32-bit scalar architecture vs TreeSearch's SSE2. Per-evaluation
   overhead in data structure management negates the SIMD advantage.

3. TNT's xmult does extensive intra-replicate sectorial search (~67% of
   trajectory entries are SECT), while TreeSearch does one XSS+RSS+CSS
   pass per outer cycle (6-10% of time).

4. TNT achieves 10-16 steps better per-replicate median scores.

Recommendations: eliminate drift from default preset (save ~20% time),
increase sectorial search rounds, reduce per-evaluation overhead.

Files: bench_trajectory.R (comparison script), trajectory_results.rds
(raw data), tnt_trajectory_analysis.md (full write-up).
ms609 added 30 commits March 28, 2026 17:41
…eline

5 large-tree datasets (131-206 tips), 3 configs, 2 budgets, 10 seeds = 300 runs.
Builds from feature/tbr-batch for pruneReinsertNni parameter.
… decision

Earlier comment described Stage 1 benchmark showing -14.7 steps improvement,
which was misleading — Stage 4 multi-dataset testing (131-206t) found the
per-rep overhead was too high (0 replicates at 206t/60s), so pruneReinsertCycles
was set to 0. Clarify the rationale and decision.
F-008: Fix constrained drift constraint staleness (T-279)
F-T-245: TBR 4-wide candidate batching (EW flat path)
Stage 4 results analysed (G-001): syab07205/206t starvation at 60s from
full-TBR polish per PR cycle (~7s x 5 = 35s overhead). Agent E implemented
pruneReinsertNni fix on feature/tbr-batch; Stage 5 scripts uploaded and
submitted to Hamilton (SLURM 16622224, ~4-6h).
Hamilton SLURM 16622421 (7h, EPYC 7702). 5 large-tree datasets
(131-206t), 20 seeds, 60/120s budgets, EW scoring.

pr_nni: wins 7/10 expected-best conditions. Huge benefit on
project3701 (146t, -178 median at 60s). Modest at 173-180t.
Slight regression at 206t (+12-34 EB).

pr_tbr: harmful (1/9 wins; total starvation at 206t/60s).

Decision: not enabled in large preset. Available via SearchControl().
Tier guidance: 5 (smoke), 10 (screening), 20 (comparison), 30 (definitive).
Calibrated from T-289f Stage 5 empirical significance results.
Cross-reference added to AGENTS.md.
Add ClipOrder enum, TBRPassRecord struct, and per-pass diagnostic
counters to tbr_search() (guarded behind TBRParams::diagnostics=true).
Add ts_tbr_diagnostics() Rcpp bridge returning per-pass data frame.
Add order_clips() helper implementing RANDOM/INV_WEIGHT/TIPS_FIRST/BUCKET
strategies (Phase 2 infrastructure, disabled by default).
Add diag_clip_ordering.R to characterise baseline behaviour.

Diagnostic results (10 seeds × 4 datasets, random Wagner starts):
  Tip-clip enrichment in productive passes: 0.43–0.76×
  Tip clips (~51% of all clips) account for only 22–38% of accepted moves.
  Medium-small clips (2..sqrt(n)) appear most productive.

CONCLUSION (Phase 4): the small/tip-first hypothesis is FALSIFIED.
All three proposed variants (INV_WEIGHT, TIPS_FIRST, BUCKET) favour
tip clips, which are the LEAST productive clip type. Phase 2–3 skipped.
Branch will be closed after coordination notes are updated.
Phase 1 diagnostic completed 2026-03-29. Hypothesis falsified:
tip clips are UNDER-represented in TBR acceptances (0.43-0.76x
enrichment across 4 datasets). Medium-small clips most productive.
All three ordering variants (inv-weight, tips-first, bucket) favour
tips — counterproductive. Branch feature/weighted-clip-order closed.
See completed-tasks.md entry PA-001 and AGENTS.md item 12.
5 datasets (62-180t), 20 seeds, EW/IW10/IW3. IW hypothesis weak signal
(closed). Real finding: XSS benefit scales with tree size. At 180t:
TAEB delta -6.8 to -9.8 EW steps (12-19% overhead). At ≤88t: zero
TAEB benefit. No preset change needed.
Stage 5 benchmark (SLURM 16622483, EPYC 7702, 5 datasets 131-206t,
10 seeds, 60s+120s) showed pr_nni (NNI full-tree polish) fixes the
Stage 4 showstopper (0 reps at 206t/60s) while improving 131-180t:

  project3701 (146t): -178 steps at 60s, -128 at 120s
  project804  (173t): -9 / -2 steps
  mbank_X30754(180t): -4 / -7 steps
  syab07205   (206t): +17.5 at 60s, neutral at 120s

Enable in large preset: pruneReinsertCycles=5L, pruneReinsertNni=TRUE.
Update AGENTS.md and completed-tasks.md. Results in
dev/benchmarks/t289f_pr_nni_polish.csv.
…_search

When params.nni_full is true but a ConstraintData is active, guard
falls through to TBR (which enforces constraints). One-line change
mirroring the nni_wagner guard in ts_driven.cpp. Only affects users
who combine pruneReinsertNni=TRUE with topological constraints; no
preset does this.

Also: S-COORD round 46 (task queue, PR status), to-do cleanup.
Agents now check remote-jobs.md at /assign time (new step 4) for
retrievable results before claiming tasks. Prevents SLURM results
from being silently lost across conversation boundaries.
C++ instrumentation of tbr_search() with post-acceptance sector-masked
TBR on clip subtree. Hit rate ~35% regardless of scoring mode (no
IW-specific benefit), but NET HARMFUL: disrupts global TBR trajectory.
mbank_X30754 EW: +17 to +34 steps TAEB at 30-120s. Validates existing
pipeline design (XSS as separate post-convergence phase). Closed.
Phase 1 (a159311) added diagnostic instrumentation and the TIPS_FIRST,
INV_WEIGHT, BUCKET, ANTI_TIP, LARGE_FIRST ordering variants to ts_tbr.cpp.
Phase 2 completes the implementation:

Bug fix: clip_order was only propagated to the initial TBR and final TBR
polish (~10% of replicate time). The ratchet and all sectorial TBR calls
defaulted to RANDOM, making the ordering variants effectively inert for
the dominant phase (ratchet ~76%).

Fix: add clip_order field to RatchetParams and SectorParams, propagate
from SearchControl through ts_driven.cpp into every TBR call site in
ts_ratchet.cpp and ts_sector.cpp (6 sites + search_sector signature).

Empirical validation (5 seeds, 30s, default config):
  Agnarsson2004 (62t, default preset): TIPS_FIRST -2%, INV_WEIGHT neutral
  Zhu2013       (75t, thorough preset): TIPS_FIRST +13%, INV_WEIGHT +9%
  Dikow2009     (88t, thorough preset): TIPS_FIRST +8%, INV_WEIGHT +3%

Theoretical model (Poisson bucket, corrected): TIPS_FIRST saves ~48%
per productive TBR pass at 88t; practical throughput gain is ~8-13%
because null passes (ordering-invariant, exhaust all clips) dilute savings.

Benefit is dataset-size dependent:
  < ~65t: tip enrichment is low (Agnarsson2004: 0.43); TIPS_FIRST neutral
  65-120t (thorough): tip enrichment moderate; TIPS_FIRST +8-13%

No preset defaults changed yet — pending GHA 10-seed validation.
bench_clip_ordering.R contains the full benchmark driver.
The SearchControl.Rd usage section was generated from an old installed
build (missing clipOrder and many parameters added since). The codoc
check correctly flagged the mismatch.

- Added @param clipOrder documentation in R/SearchControl.R
- Regenerated man/SearchControl.Rd with correct \usage and \item{clipOrder}
 TBR clip-ordering strategy (SearchControl clipOrder)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant