From d53dfb44197fc2400a516e2baf8c16eed115106d Mon Sep 17 00:00:00 2001
From: Gabrielle Gauthier-Melancon <gabrielle.gm@servicenow.com>
Date: Tue, 12 May 2026 14:53:53 -0400
Subject: [PATCH 1/3] Improve speech fidelity prompt for interruption tags
 interpretation

---
 configs/prompts/judge.yaml | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/configs/prompts/judge.yaml b/configs/prompts/judge.yaml
index e6a17249..a4076424 100644
--- a/configs/prompts/judge.yaml
+++ b/configs/prompts/judge.yaml
@@ -14,9 +14,11 @@ _shared:
     - `[user interrupts]` — The user started speaking while the assistant was still talking. As a prefix on user text, it marks the start of overlapping user speech. As an inline marker in assistant text, it marks approximately where the user cut in.
     - `[likely cut off by user]` — Appears in assistant text. The assistant's speech was probably cut off by the user starting to speak. Text before this tag may not have been fully spoken. Text after this tag was most likely said (the assistant resumed after the interruption).
     - `[likely cut off by assistant]` — Appears in user text. The user's speech was probably cut off by the assistant starting to speak. Text before this tag may not have been fully spoken.
-    - `[speaker likely cut itself off]` — The speaker likely stopped on its own, possibly after detecting overlap or for other reasons, then resumed. Text before this tag may not have been fully spoken. Text after is what the speaker said after resuming.
+    - `[speaker likely cut itself off]` — The speaker likely stopped on its own, possibly after detecting overlap or for other reasons. If the tag appears mid-turn, text after it is what the speaker said after resuming. If the tag appears at the end of the turn, the speaker did not resume in this turn. Text before this tag may not have been fully spoken.
     - `[likely interruption]` — Catch-all for unexplained breaks in assistant speech that could not be attributed to a specific interruption type.
 
+    **Scope of "text before this tag":** For all four cut-off tags (`[likely cut off by user]`, `[likely cut off by assistant]`, `[speaker likely cut itself off]`, `[likely interruption]`), "text before this tag" means **all text from the start of the turn (or from the previous resumption point) up to the tag** — not just the few adjacent words. This region can span multiple sentences and contain multiple entities (confirmation codes, dollar amounts, names, etc.). Any or all of it may have been silently dropped from the audio. The tag is the only signal you have about where speech was actually cut; do not assume the cut-off was small.
+
 judge:
   conciseness:
     user_prompt: |
@@ -653,7 +655,7 @@ judge:
 
         The tags tell you that certain portions of the intended text were likely never spoken, because the speaker was interrupted or cut themselves off. Do NOT penalize for missing words that fall in a region the tags indicate was not spoken.
 
-        **Key principle:** If a tag indicates that a section of text was likely not spoken aloud (due to interruption or cut-off), do NOT penalize for those words being missing from the audio. Only evaluate fidelity for words that were reasonably expected to have been spoken.
+        **Key principle:** If a tag indicates that a section of text was likely not spoken aloud (due to interruption or cut-off), do NOT penalize for those words being missing from the audio — **including entity words** (confirmation codes, dollar amounts, names, etc.). A turn where the only missing content falls inside a cut-off region is rating = 1, even if that missing content contains critical entities.
 
         ## Evaluation Criteria
 
@@ -683,6 +685,7 @@ judge:
         - Slight pacing or prosody differences
         - Non-spoken tags: [slow], [firm], [annoyed], and all interruption tags listed above
         - Words in regions flagged by interruption tags as likely not spoken
+        - **Adjacent-turn drift:** turn boundaries are derived from imperfect timing/event heuristics, so a sentence may land in the wrong adjacent turn (turn N's intended text actually spoken in turn N-1 or N+1, or vice versa). If content from turn N appears to have been spoken in an adjacent turn, treat it as fidelity-preserved rather than missing. Only mark missing if the content does not appear in this turn **or** in either neighboring turn's audio.
 
         ## Rating Scale (per turn)
         - **1 (High Fidelity)**: All entities are spoken correctly. Non-entity words are faithfully reproduced with no meaningful omissions or additions.

From bd8a7a067156304c39bfffe325a7f894aed78fa5 Mon Sep 17 00:00:00 2001
From: Gabrielle Gauthier-Melancon <gabrielle.gm@servicenow.com>
Date: Tue, 12 May 2026 16:27:30 -0400
Subject: [PATCH 2/3] Categorize agent_speech_fidelity errors into 4 failure
 modes

Surfaces entity_error, truncation, garbled_hallucination, and
insertion_hallucination as per-record rate sub-metrics on
agent_speech_fidelity, so failure analysis can attribute drops to a
specific TTS pathology rather than a single opaque score. The judge now
emits a failure_modes list per low-fidelity turn; the base extracts it
and a new build_per_category_rate_sub_metrics helper turns those tags
into flagged_turns/rated_turns rates (shared with conciseness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 configs/prompts/judge.yaml                    |  21 +++-
 .../metrics/accuracy/agent_speech_fidelity.py |  30 ++++-
 src/eva/metrics/experience/conciseness.py     |  26 ++--
 src/eva/metrics/speech_fidelity_base.py       |  22 ++++
 src/eva/metrics/utils.py                      |  45 +++++++
 tests/fixtures/metric_signatures.json         |  10 +-
 tests/unit/metrics/test_speech_fidelity.py    | 115 ++++++++++++++++++
 7 files changed, 244 insertions(+), 25 deletions(-)

diff --git a/configs/prompts/judge.yaml b/configs/prompts/judge.yaml
index a4076424..8fbb2073 100644
--- a/configs/prompts/judge.yaml
+++ b/configs/prompts/judge.yaml
@@ -691,15 +691,32 @@ judge:
         - **1 (High Fidelity)**: All entities are spoken correctly. Non-entity words are faithfully reproduced with no meaningful omissions or additions.
         - **0 (Low Fidelity)**: One or more entity errors, OR significant non-entity word errors that change the meaning of the turn.
 
+        ## Failure Modes (only when rating = 0)
+        For each low-fidelity turn, tag every failure mode that applies. A turn may have multiple failure modes. Leave the list empty when rating = 1.
+
+        - **entity_error** — A TTS-critical entity (confirmation code, dollar amount, name, date, time, flight/seat number, reference ID, etc.) was rendered incorrectly in the audio. Use this whenever the intended text shows an entity at that position, regardless of *how* it went wrong: a wrong digit/letter, a missing character in a spelled-out code, a value swapped for a different one, OR sounds that are garbled / slurred / unintelligible in place of an entity. The intended text tells you whether a given position is an entity — if it is, any defect there is `entity_error`, not `garbled_hallucination`.
+
+        - **truncation** — Expected content from the intended text is missing in the audio, and the missing region is **NOT** covered by an interruption tag (`[likely cut off ...]`, `[user interrupts]`, `[speaker likely cut itself off]`, `[likely interruption]`). Apply only when the speaker silently dropped content that the tags do not explain. Missing content before a tag is fidelity-preserved by design and should NOT be flagged here.
+
+        - **garbled_hallucination** — The audio contains speech-like sounds in place of expected **non-entity** words, but the sounds are distorted, slurred, or unintelligible enough that the listener cannot reliably parse what was said. The TTS produced noise/non-words rather than a clean rendering of the intended text. Use this category only for non-entity regions — if the garbled region corresponds to an entity in the intended text, use `entity_error` instead.
+
+        - **insertion_hallucination** — The audio contains words or phrases that were NOT in the intended text. The TTS added content on its own — extra sentences, repeated phrases, or filler that the script did not contain.
+
+        Tagging guidance:
+        - If a turn has both an entity error and missing/dropped content outside tagged regions, list both `entity_error` and `truncation`.
+        - Garbled audio at an entity position is `entity_error`, not `garbled_hallucination` — check the intended text first to decide whether the affected region is an entity.
+        - Do NOT use `truncation` for content lost to interruption tags, even if the interruption tags are not at the exact location where the truncation occured — that is expected behavior, not an error.
+
         ## Response Format
         Respond with a JSON object. Each turn entry must include the turn_id matching the turn number shown in the Intended Turns above:
         {{
           "turns": [
             {{
               "turn_id": <int: the turn number from the Intended Turns>,
-              "transcript": <string: your transcription of the audio for this turn, use only the audio for this not the intended text>
+              "transcript": <string: your transcription of the audio for this turn, use only the audio for this not the intended text>,
               "explanation": "<string: 1-3 sentence analysis of fidelity for this turn, citing specific intended vs actual mismatches, noting any regions skipped due to interruption flags>",
-              "rating": <0 or 1>
+              "rating": <0 or 1>,
+              "failure_modes": <list of strings: zero or more of "entity_error", "truncation", "garbled_hallucination", "insertion_hallucination". Must be [] when rating = 1.>
             }}
           ]
         }}
diff --git a/src/eva/metrics/accuracy/agent_speech_fidelity.py b/src/eva/metrics/accuracy/agent_speech_fidelity.py
index d84179f7..721d13cc 100644
--- a/src/eva/metrics/accuracy/agent_speech_fidelity.py
+++ b/src/eva/metrics/accuracy/agent_speech_fidelity.py
@@ -1,7 +1,17 @@
 """Agent speech fidelity metric using audio + LLM judge (Gemini)."""
 
+from eva.metrics.base import MetricContext
 from eva.metrics.registry import register_metric
 from eva.metrics.speech_fidelity_base import SpeechFidelityBaseMetric
+from eva.metrics.utils import build_per_category_rate_sub_metrics
+from eva.models.results import MetricScore
+
+_SPEECH_FIDELITY_FAILURE_MODES = (
+    "entity_error",
+    "truncation",
+    "garbled_hallucination",
+    "insertion_hallucination",
+)
 
 
 @register_metric
@@ -14,9 +24,27 @@ class AgentSpeechFidelityMetric(SpeechFidelityBaseMetric):
     """
 
     name = "agent_speech_fidelity"
-    version = "v0.1"
+    version = "v0.2"
     description = "Audio-based evaluation of agent speech fidelity to the intended text"
     category = "accuracy"
     role = "assistant"
     rating_scale = (0, 1)
     pass_at_k_threshold = 0.95
+
+    def build_sub_metrics(
+        self,
+        context: MetricContext,
+        per_turn_ratings: dict[int, int | None],
+        per_turn_failure_modes: dict[int, list[str]],
+    ) -> dict[str, MetricScore] | None:
+        """Surface one sub-metric per failure mode: rate = flagged turns / rated turns."""
+        rated_turn_ids = [tid for tid, r in per_turn_ratings.items() if r is not None]
+        return (
+            build_per_category_rate_sub_metrics(
+                parent_name=self.name,
+                categories=_SPEECH_FIDELITY_FAILURE_MODES,
+                rated_turn_ids=rated_turn_ids,
+                per_turn_categories=per_turn_failure_modes,
+            )
+            or None
+        )
diff --git a/src/eva/metrics/experience/conciseness.py b/src/eva/metrics/experience/conciseness.py
index 2b4bc196..5ac7eab3 100644
--- a/src/eva/metrics/experience/conciseness.py
+++ b/src/eva/metrics/experience/conciseness.py
@@ -4,7 +4,7 @@
 
 from eva.metrics.base import MetricContext, PerTurnConversationJudgeMetric
 from eva.metrics.registry import register_metric
-from eva.metrics.utils import make_rate_sub_metric
+from eva.metrics.utils import build_per_category_rate_sub_metrics
 from eva.models.results import MetricScore
 
 _CONCISENESS_FAILURE_MODES = (
@@ -70,21 +70,13 @@ def build_sub_metrics(
     ) -> dict[str, MetricScore] | None:
         """Surface one sub-metric per failure mode, rate = flagged turns / rated turns."""
         rated_turn_ids = [tid for tid, r in per_turn_ratings.items() if r is not None]
-        num_rated = len(rated_turn_ids)
-        if num_rated == 0:
-            return None
-
-        sub_metrics: dict[str, MetricScore] = {}
-        for mode in _CONCISENESS_FAILURE_MODES:
-            flagged_ids = [
-                tid for tid in rated_turn_ids if mode in (per_turn_extra.get(tid, {}).get("failure_modes") or [])
-            ]
-            sub_key = f"{mode}_rate"
-            sub_metrics[sub_key] = make_rate_sub_metric(
+        per_turn_failure_modes = {tid: extra.get("failure_modes") or [] for tid, extra in per_turn_extra.items()}
+        return (
+            build_per_category_rate_sub_metrics(
                 parent_name=self.name,
-                key=sub_key,
-                numerator=len(flagged_ids),
-                denominator=num_rated,
-                details={"count": len(flagged_ids), "num_rated": num_rated, "turn_ids": flagged_ids},
+                categories=_CONCISENESS_FAILURE_MODES,
+                rated_turn_ids=rated_turn_ids,
+                per_turn_categories=per_turn_failure_modes,
             )
-        return sub_metrics
+            or None
+        )
diff --git a/src/eva/metrics/speech_fidelity_base.py b/src/eva/metrics/speech_fidelity_base.py
index c6774967..556bd30e 100644
--- a/src/eva/metrics/speech_fidelity_base.py
+++ b/src/eva/metrics/speech_fidelity_base.py
@@ -77,6 +77,7 @@ async def compute(self, context: MetricContext) -> MetricScore:
             per_turn_explanations: dict[int, str] = {}
             per_turn_transcripts: dict[int, str] = {}
             per_turn_normalized: dict[int, float] = {}
+            per_turn_failure_modes: dict[int, list[str]] = {}
             tts_turn_ids = sorted(intended_turns.keys())
             min_rating, max_rating = self.rating_scale
             valid_ratings_range = list(range(min_rating, max_rating + 1))
@@ -108,17 +109,22 @@ async def compute(self, context: MetricContext) -> MetricScore:
                 rating = response_item.get("rating")
                 transcript = response_item.get("transcript")
                 explanation = response_item.get("explanation", "")
+                failure_modes = response_item.get("failure_modes") or []
+                if not isinstance(failure_modes, list):
+                    failure_modes = []
 
                 if rating not in valid_ratings_range:
                     self.logger.warning(f"[{context.record_id}] Invalid rating {rating} for turn {turn_id}")
                     per_turn_ratings[turn_id] = None
                     per_turn_explanations[turn_id] = f"Invalid rating: {rating}"
+                    per_turn_failure_modes[turn_id] = failure_modes
                     continue
 
                 per_turn_ratings[turn_id] = rating
                 per_turn_explanations[turn_id] = explanation
                 per_turn_transcripts[turn_id] = transcript
                 per_turn_normalized[turn_id] = normalize_rating(rating, min_rating, max_rating)
+                per_turn_failure_modes[turn_id] = failure_modes
 
             aggregated_score = aggregate_per_turn_scores(list(per_turn_normalized.values()), self.aggregation)
 
@@ -132,18 +138,22 @@ async def compute(self, context: MetricContext) -> MetricScore:
                 "audio_trimmed": self.trim_silence,
                 "per_turn_ratings": per_turn_ratings,
                 "per_turn_explanations": per_turn_explanations,
+                "per_turn_failure_modes": per_turn_failure_modes,
                 "judge_prompt": prompt,
                 "judge_raw_response": response_text,
             }
             if min_rating != 0 or max_rating != 1:
                 details["per_turn_normalized"] = per_turn_normalized
 
+            sub_metrics = self.build_sub_metrics(context, per_turn_ratings, per_turn_failure_modes)
+
             return MetricScore(
                 name=self.name,
                 score=round(avg_rating, 3),
                 normalized_score=round(aggregated_score, 3) if aggregated_score is not None else 0,
                 details=details,
                 error="Aggregation failed" if aggregated_score is None else None,
+                sub_metrics=sub_metrics or None,
             )
 
         except Exception as e:
@@ -389,3 +399,15 @@ def _get_intended_turns(self, context: MetricContext) -> dict[int, str]:
     def _format_intended_turns(intended_turns: dict[int, str]) -> str:
         """Format intended turns dictionary as numbered list."""
         return "\n".join(f"Turn {turn_id}: {text}" for turn_id, text in intended_turns.items())
+
+    def build_sub_metrics(
+        self,
+        context: MetricContext,
+        per_turn_ratings: dict[int, int | None],
+        per_turn_failure_modes: dict[int, list[str]],
+    ) -> dict[str, MetricScore] | None:
+        """Return sub-metrics derived from per-turn data, or None.
+
+        Default returns None so the parent metric has no sub-metrics.
+        """
+        return None
diff --git a/src/eva/metrics/utils.py b/src/eva/metrics/utils.py
index 01415c4a..9ffad52c 100644
--- a/src/eva/metrics/utils.py
+++ b/src/eva/metrics/utils.py
@@ -448,6 +448,51 @@ def build_binary_flag_sub_metrics(
     return sub_metrics
 
 
+def build_per_category_rate_sub_metrics(
+    parent_name: str,
+    categories: tuple[str, ...],
+    rated_turn_ids: list[int],
+    per_turn_categories: dict[int, list[str] | None],
+    key_suffix: str = "_rate",
+) -> dict[str, MetricScore]:
+    """Build per-category rate sub-metrics from per-turn category tags.
+
+    For each known category, compute ``flagged_turns / rated_turns`` across the
+    given rated turns and emit a sub-metric named ``f"{parent_name}.{category}{key_suffix}"``.
+    Categories absent from ``per_turn_categories[turn_id]`` count as not flagged.
+    Categories the judge invents but that aren't in ``categories`` are silently
+    ignored (callers preserve them in ``details`` if they want them visible).
+
+    Args:
+        parent_name: Parent metric name (prefix in the sub-metric name).
+        categories: Ordered tuple of known category keys to emit a rate for.
+        rated_turn_ids: Turns to use as the denominator (typically those with a
+            non-null rating).
+        per_turn_categories: ``{turn_id: list of category strings the judge tagged}``.
+            Missing or ``None`` values are treated as no categories flagged.
+        key_suffix: Suffix appended to each category key (default ``"_rate"``).
+
+    Returns:
+        Dict of sub-metrics keyed by ``f"{category}{key_suffix}"``. Empty when
+        ``rated_turn_ids`` is empty.
+    """
+    sub_metrics: dict[str, MetricScore] = {}
+    num_rated = len(rated_turn_ids)
+    if num_rated == 0:
+        return sub_metrics
+    for category in categories:
+        flagged_ids = [tid for tid in rated_turn_ids if category in (per_turn_categories.get(tid) or [])]
+        sub_key = f"{category}{key_suffix}"
+        sub_metrics[sub_key] = make_rate_sub_metric(
+            parent_name=parent_name,
+            key=sub_key,
+            numerator=len(flagged_ids),
+            denominator=num_rated,
+            details={"count": len(flagged_ids), "num_rated": num_rated, "turn_ids": flagged_ids},
+        )
+    return sub_metrics
+
+
 def aggregate_per_turn_scores(scores: list[float | None], aggregation: str) -> float | None:
     """Aggregate per-turn scores using specified method.
 
diff --git a/tests/fixtures/metric_signatures.json b/tests/fixtures/metric_signatures.json
index fc79f392..83b66967 100644
--- a/tests/fixtures/metric_signatures.json
+++ b/tests/fixtures/metric_signatures.json
@@ -1,13 +1,13 @@
 {
   "AgentSpeechFidelityMetric": {
     "name": "agent_speech_fidelity",
-    "prompt_hash": "864be78919d2",
-    "source_hash": "77743114e9b0",
-    "version": "v0.1"
+    "prompt_hash": "c34b4ccf458f",
+    "source_hash": "2daa628603bd",
+    "version": "v0.2"
   },
   "AgentSpeechFidelityS2SMetric": {
     "name": "agent_speech_fidelity",
-    "prompt_hash": "864be78919d2",
+    "prompt_hash": "c34b4ccf458f",
     "source_hash": "5b3deb4968cd",
     "version": "v0.1"
   },
@@ -20,7 +20,7 @@
   "ConcisenessJudgeMetric": {
     "name": "conciseness",
     "prompt_hash": "5d033338d36a",
-    "source_hash": "cd0ea09a9613",
+    "source_hash": "cd73d4caaf6f",
     "version": "v0.1"
   },
   "ConversationCorrectlyFinishedMetric": {
diff --git a/tests/unit/metrics/test_speech_fidelity.py b/tests/unit/metrics/test_speech_fidelity.py
index 9628a569..13ee24ff 100644
--- a/tests/unit/metrics/test_speech_fidelity.py
+++ b/tests/unit/metrics/test_speech_fidelity.py
@@ -208,6 +208,121 @@ async def test_no_per_turn_normalized_in_details(self, agent_metric):
         assert "per_turn_normalized" not in result.details
 
 
+class TestAgentFailureModeSubMetrics:
+    """Verify failure_modes produce per-mode rate sub-metrics on the agent metric."""
+
+    @pytest.mark.asyncio
+    async def test_failure_modes_produce_rate_sub_metrics(self, agent_metric):
+        """Two rated turns, one flagged with two modes → 0.5 rates for each, 0.0 for others."""
+        response = make_judge_response(
+            [
+                {
+                    "turn_id": 0,
+                    "rating": 0,
+                    "explanation": "Wrong code and dropped sentence",
+                    "failure_modes": ["entity_error", "truncation"],
+                },
+                {"turn_id": 1, "rating": 1, "explanation": "Good", "failure_modes": []},
+            ]
+        )
+        agent_metric.llm_client.generate_text.return_value = (response, None)
+        with patch.object(agent_metric, "load_role_audio", return_value=MagicMock()):
+            with patch.object(agent_metric, "encode_audio_segment", return_value="base64audio"):
+                context = _default_context()
+                result = await agent_metric.compute(context)
+
+        assert result.sub_metrics is not None
+        assert set(result.sub_metrics.keys()) == {
+            "entity_error_rate",
+            "truncation_rate",
+            "garbled_hallucination_rate",
+            "insertion_hallucination_rate",
+        }
+        assert result.sub_metrics["entity_error_rate"].score == 0.5
+        assert result.sub_metrics["truncation_rate"].score == 0.5
+        assert result.sub_metrics["garbled_hallucination_rate"].score == 0.0
+        assert result.sub_metrics["insertion_hallucination_rate"].score == 0.0
+        assert result.sub_metrics["entity_error_rate"].name == "agent_speech_fidelity.entity_error_rate"
+        assert result.sub_metrics["entity_error_rate"].details == {
+            "count": 1,
+            "num_rated": 2,
+            "turn_ids": [0],
+        }
+        assert result.details["per_turn_failure_modes"] == {0: ["entity_error", "truncation"], 1: []}
+
+    @pytest.mark.asyncio
+    async def test_no_failure_modes_returned_all_zero(self, agent_metric):
+        """Missing failure_modes in judge response → all rates 0.0, still surfaced."""
+        response = make_judge_response(
+            [
+                {"turn_id": 0, "rating": 1, "explanation": "Good"},
+                {"turn_id": 1, "rating": 1, "explanation": "Good"},
+            ]
+        )
+        agent_metric.llm_client.generate_text.return_value = (response, None)
+        with patch.object(agent_metric, "load_role_audio", return_value=MagicMock()):
+            with patch.object(agent_metric, "encode_audio_segment", return_value="base64audio"):
+                context = _default_context()
+                result = await agent_metric.compute(context)
+
+        assert result.sub_metrics is not None
+        for key in (
+            "entity_error_rate",
+            "truncation_rate",
+            "garbled_hallucination_rate",
+            "insertion_hallucination_rate",
+        ):
+            assert result.sub_metrics[key].score == 0.0
+        assert result.details["per_turn_failure_modes"] == {0: [], 1: []}
+
+    @pytest.mark.asyncio
+    async def test_unknown_mode_ignored(self, agent_metric):
+        """Modes the judge invents are stored in details but do not produce sub-metrics."""
+        response = make_judge_response(
+            [
+                {
+                    "turn_id": 0,
+                    "rating": 0,
+                    "explanation": "Made up mode",
+                    "failure_modes": ["something_new", "entity_error"],
+                },
+                {"turn_id": 1, "rating": 1, "explanation": "Good", "failure_modes": []},
+            ]
+        )
+        agent_metric.llm_client.generate_text.return_value = (response, None)
+        with patch.object(agent_metric, "load_role_audio", return_value=MagicMock()):
+            with patch.object(agent_metric, "encode_audio_segment", return_value="base64audio"):
+                context = _default_context()
+                result = await agent_metric.compute(context)
+
+        # Unknown mode preserved in details, no extra sub-metric created
+        assert "something_new" in result.details["per_turn_failure_modes"][0]
+        assert set(result.sub_metrics.keys()) == {
+            "entity_error_rate",
+            "truncation_rate",
+            "garbled_hallucination_rate",
+            "insertion_hallucination_rate",
+        }
+        assert result.sub_metrics["entity_error_rate"].score == 0.5
+
+    @pytest.mark.asyncio
+    async def test_user_metric_has_no_sub_metrics(self, user_metric):
+        """user_speech_fidelity does not override build_sub_metrics, so no sub-metrics are surfaced."""
+        response = make_judge_response(
+            [
+                {"turn_id": 0, "rating": 3, "explanation": "Good"},
+                {"turn_id": 1, "rating": 1, "explanation": "Bad", "failure_modes": ["entity_error"]},
+            ]
+        )
+        user_metric.llm_client.generate_text.return_value = (response, None)
+        with patch.object(user_metric, "load_role_audio", return_value=MagicMock()):
+            with patch.object(user_metric, "encode_audio_segment", return_value="base64audio"):
+                context = _default_context()
+                result = await user_metric.compute(context)
+
+        assert result.sub_metrics is None
+
+
 class TestUserSpeechFidelityCompute:
     """Test user speech fidelity compute with 1-3 ratings."""
 

From 41aff9bed70a4544eba63b02ca390a767b388c43 Mon Sep 17 00:00:00 2001
From: Gabrielle Gauthier-Melancon <gabrielle.gm@servicenow.com>
Date: Fri, 15 May 2026 22:39:25 -0400
Subject: [PATCH 3/3] Clarify tag

---
 configs/prompts/judge.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configs/prompts/judge.yaml b/configs/prompts/judge.yaml
index 8fbb2073..c5fe8f8c 100644
--- a/configs/prompts/judge.yaml
+++ b/configs/prompts/judge.yaml
@@ -17,7 +17,7 @@ _shared:
     - `[speaker likely cut itself off]` — The speaker likely stopped on its own, possibly after detecting overlap or for other reasons. If the tag appears mid-turn, text after it is what the speaker said after resuming. If the tag appears at the end of the turn, the speaker did not resume in this turn. Text before this tag may not have been fully spoken.
     - `[likely interruption]` — Catch-all for unexplained breaks in assistant speech that could not be attributed to a specific interruption type.
 
-    **Scope of "text before this tag":** For all four cut-off tags (`[likely cut off by user]`, `[likely cut off by assistant]`, `[speaker likely cut itself off]`, `[likely interruption]`), "text before this tag" means **all text from the start of the turn (or from the previous resumption point) up to the tag** — not just the few adjacent words. This region can span multiple sentences and contain multiple entities (confirmation codes, dollar amounts, names, etc.). Any or all of it may have been silently dropped from the audio. The tag is the only signal you have about where speech was actually cut; do not assume the cut-off was small.
+    **Scope of "text before this tag":** For all four cut-off tags (`[likely cut off by user]`, `[likely cut off by assistant]`, `[speaker likely cut itself off]`, `[likely interruption]`), "text before this tag" means **all text from the start of the turn (or from the previous resumption point) up to the tag** — not just the few adjacent words. This region can span multiple sentences and contain multiple entities (confirmation codes, dollar amounts, names, etc.). Any or all of it may have been silently dropped from the audio. The tag is just a signal that an interruption occurred; it might not be at the right place and do not assume the cut-off was small.
 
 judge:
   conciseness: