Skip to content

feat: early dedup of optimization candidates before benchmark loop#1884

Open
KRRT7 wants to merge 5 commits intomainfrom
early-dedup
Open

feat: early dedup of optimization candidates before benchmark loop#1884
KRRT7 wants to merge 5 commits intomainfrom
early-dedup

Conversation

@KRRT7
Copy link
Collaborator

@KRRT7 KRRT7 commented Mar 22, 2026

Summary

  • Dedup candidates in CandidateProcessor when each batch arrives (initial, line profiler, repair, refinement, adaptive) instead of only catching duplicates one-by-one during the benchmark loop
  • Persistent seen_normalized set tracks queued candidates across batches, so duplicates between sources (e.g. initial optimization vs line profiler) are caught even before they're benchmarked
  • Gives accurate candidate counts in progress display ("Candidate X/10" not "X/13")
  • Simplifies handle_duplicate_candidate/register_new_candidate to accept original_flat_code: str instead of full CodeOptimizationContext

Motivation

Observed in CI run on PR #2485 (codeflash-internal): 3 of 13 candidates were duplicates across sources (initial optimization vs line profiler). While the existing per-candidate dedup correctly skipped benchmarking them, the candidate count was inflated and minor per-candidate overhead was wasted.

Test plan

  • 14 unit tests in tests/test_early_dedup.py covering:
    • Unique candidates pass through
    • Identical-to-original removal
    • Intra-batch and cross-batch dedup
    • Persistent seen_normalized catches dups across batches
    • Result copying for duplicates (success + failure paths)
    • Shorter source code tracking
  • Existing 1226 tests pass (1 pre-existing failure unrelated)

@claude
Copy link
Contributor

claude bot commented Mar 22, 2026

Claude finished @KRRT7's task in 2m 26s —— View job


  • Triage PR size — LARGE (113 lines production code changed)
  • Run lint/typecheck (prek) — passes clean
  • Resolve stale threads — no unresolved threads
  • Review diff
  • Duplicate detection
  • Post summary

PR Review Summary

Prek Checks

uv run prek run --from-ref origin/main passes clean. All 14 new unit tests pass.

Code Review

Bug: is_correct.get() type violation — still present from previous review

File: codeflash/models/models.py:553

self.is_correct[candidate.optimization_id] = self.is_correct.get(past_opt_id)

is_correct is declared dict[str, bool] (line 522), but .get() without a default returns bool | None. This stores None into the dict when past_opt_id was registered but never benchmarked — a type contract violation that could trip up future code expecting bool. This was flagged in the previous review and is still unaddressed.

Fix:

self.is_correct[candidate.optimization_id] = self.is_correct.get(past_opt_id, False)

Fix this →

Nit: unused import pytest in test file — still present from previous review

File: tests/test_early_dedup.py:8

import pytest is imported but never used (no pytest.raises, pytest.mark, etc.). Flagged in the previous review and not yet removed.

Fix this →

Duplicate Detection

No duplicates detected. dedup_candidates is new logic centralized in CandidateProcessor; the normalizer dispatch (normalize_fn) correctly delegates to language-specific implementations, so there's no parallel reimplementation in languages/python/ or languages/javascript/.

Test Coverage

14 new unit tests in tests/test_early_dedup.py cover all dedup paths: unique pass-through, identical-to-original, intra-batch, cross-batch, result copying, and shorter-source tracking. All pass. The aiservice.py and function_optimizer.py coverage is low (14%), but that's expected for integration-level files that require a full orchestration environment.


Overall: The implementation is correct and the design is sound. Two issues from the previous review cycle remain open — the is_correct.get() type violation (minor bug) and the unused import pytest (nit). Both are straightforward one-line fixes.
· Branch: early-dedup

Dedup candidates in CandidateProcessor when each batch arrives (initial,
line profiler, repair, refinement, adaptive) instead of only catching
duplicates one-by-one during the benchmark loop.

Changes:
- Add dedup_candidates() to CandidateProcessor with persistent
  seen_normalized set that tracks queued candidates across batches
- Simplify handle_duplicate_candidate/register_new_candidate to accept
  original_flat_code string instead of full CodeOptimizationContext
- 14 unit tests covering all dedup paths
…alization

Candidates that pass through dedup_candidates already have their
normalized code computed — cache it and pass to process_single_candidate
to skip redundant AST parsing.
@KRRT7
Copy link
Collaborator Author

KRRT7 commented Mar 22, 2026

@claude fix the issues identified in #1884 (comment)

@claude
Copy link
Contributor

claude bot commented Mar 22, 2026

Claude finished @KRRT7's task in 1m 13s —— View job


Fixed issues from PR review

  • Gather context
  • Fix bug: handle_duplicate_candidate dict access raises KeyError → use .get()
  • Minor: Add comment for intra-batch duplicate silent drop behavior
  • Nit: Add comment explaining dead-code defensive fallbacks in process_single_candidate
  • Run prek to verify — passes clean
  • Commit and push

Changes made

codeflash/models/models.py — bug fix

speedup_ratios, is_correct, and optimized_runtimes now use .get(past_opt_id) instead of direct dict access. Guards against the edge case where a candidate was registered via register_new_candidate but never benchmarked (e.g. unhandled exception in process_single_candidate before record_* was called).

codeflash/languages/function_optimizer.py:307 — comment added

Explains that intra-batch duplicates are intentionally dropped with no result back-fill, since no benchmark results exist yet to copy.

codeflash/languages/function_optimizer.py:1150 — comment added

Explains that the normalized_code == normalized_original and normalized_code in eval_ctx.ast_code_to_id checks are defensive fallbacks that should not fire in normal operation (since dedup_candidates already filters these before the benchmark loop).
· Branch: early-dedup

…comments

- Use .get() for speedup_ratios/is_correct/optimized_runtimes when copying
  results from past_opt_id, guarding against the case where a candidate was
  registered but never benchmarked due to an unhandled exception
- Add comment explaining intra-batch duplicates are intentionally dropped
  with no result back-fill (no results exist yet to copy)
- Add comment explaining the normalized_code checks in process_single_candidate
  are defensive fallbacks that should not fire in normal operation

Co-authored-by: Kevin Turcios <KRRT7@users.noreply.github.com>
@KRRT7
Copy link
Collaborator Author

KRRT7 commented Mar 22, 2026

@claude review this PR

@claude
Copy link
Contributor

claude bot commented Mar 22, 2026

Claude finished @KRRT7's task in 2m 43s —— View job


PR Review

  • Gather context
  • Read changed files
  • Analyze implementation
  • Check test coverage
  • Post review findings

Overall Assessment

The design is sound and well-implemented. The core invariant — that _handle_empty_queue is only reached when the queue is empty (i.e., all prior candidates have been benchmarked sequentially) — means seen_normalized and ast_code_to_id cover all dedup scenarios without races. Test coverage is thorough.

Two issues worth addressing before merge:


Bug: is_correct.get() introduces a type violation

File: codeflash/models/models.py:553

self.is_correct[candidate.optimization_id] = self.is_correct.get(past_opt_id)

is_correct is declared as dict[str, bool] (line 522), but .get() returns bool | None. If past_opt_id was registered but never benchmarked, this stores None into the dict. The consequence is subtle: codeflash/languages/function_optimizer.py:2894 does if is_correct which works with None (falsy), but it's a type contract violation that could trip up future code expecting bool.

The fix is a one-character change — use .get(past_opt_id, False) to default to False:

self.is_correct[candidate.optimization_id] = self.is_correct.get(past_opt_id, False)

Fix this →


Nit: import pytest is unused in tests/test_early_dedup.py

File: tests/test_early_dedup.py:8

import pytest

pytest is imported but never referenced in the file (no pytest.raises, pytest.mark, etc.). With select = ["ALL"] and no F401 in the ignore list in pyproject.toml, this would normally be flagged by ruff. Worth removing to keep the file clean.

Fix this →


Minor: misleading dedup log message for a specific edge case

File: codeflash/languages/function_optimizer.py:307-311

When an initial-batch candidate fails before register_new_candidate is called (e.g., did_update=False at line 1185, or syntax error at line 1189), it is in seen_normalized but not in ast_code_to_id. If an async batch later has an identical candidate, it hits the seen_normalized branch and is counted in removed_duplicate with the log message "duplicates" — not "already-benchmarked duplicates". This is correct behavior (the dup would also fail), but the comment on line 308 ("Intra-batch duplicate") is slightly misleading in this cross-batch scenario. Low priority, no code change needed.


Design observations (no action needed)

  • seen_normalized persists after candidates are benchmarked: This causes async batch duplicates of already-benchmarked initial candidates to be caught by seen_normalized first (before ast_code_to_id), falling through to the silent-drop path instead of the result-copying path. In practice this can't happen because by the time the queue empties and async batches are fetched, all initial candidates are in ast_code_to_id. The order of checks is therefore safe.

  • candidate_len denominator grows mid-run: total_candidates passed to process_single_candidate is processor.candidate_len which grows as async batches arrive. Users will see "Candidate 1/5" then "Candidate 6/8". This is intentional and better than the previous inflated-by-dupes count.

  • Intra-batch seen_normalized check correctly precedes ast_code_to_id check: The ordering matters — ast_code_to_id catches "already benchmarked" dups and copies results; seen_normalized catches "same batch, not yet benchmarked" dups and drops them. The check order correctly handles both.
    · Branch: early-dedup

KRRT7 added 2 commits March 22, 2026 07:10
…hang

_get_valid_candidates had no try/except around OptimizedCandidate
construction. A Pydantic ValidationError from one bad candidate would
crash the entire batch, propagate as an unhandled exception, and block
on executor.shutdown(wait=True) in the finally block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant