Skip to content

Add NTIA advisory board analysis tasks#314

Open
ScuttleBot wants to merge 1 commit intomainfrom
tasks/meeting-advisory
Open

Add NTIA advisory board analysis tasks#314
ScuttleBot wants to merge 1 commit intomainfrom
tasks/meeting-advisory

Conversation

@ScuttleBot
Copy link
Copy Markdown

Adds 5 new meeting analysis tasks based on the NTIA CSMAC advisory board transcript (May 30, 2012 meeting on spectrum sharing in the 1755-1850 MHz band).

Tasks

  1. task_meeting_advisory_attendees — Create structured attendee list with roles, organizations, attendance mode (in-person vs phone), and speaking participation
  2. task_meeting_advisory_stakeholders — Identify stakeholder groups, their interests, positions on sharing vs relocation, and map key tensions/agreements
  3. task_meeting_advisory_technical — Extract technical discussions including frequency bands, federal systems, interference challenges, five working group assignments, and the commercial parameters debate
  4. task_meeting_advisory_timeline — Extract all timelines, deadlines, and milestones (historical references, working group deadlines, transition timeframes)
  5. task_meeting_advisory_acronyms — Build comprehensive acronym glossary with expansions, context, and categorization

All tasks use the same transcript asset: assets/meetings/2012-05-30-meeting-transcript-ntia-csmac.md

Closes #191, Closes #192, Closes #193, Closes #194, Closes #195

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid set of task definitions. The grading functions are well-structured, the regex patterns cover expected variations in agent output, and the LLM judge rubrics are clearly calibrated with sensible weight distributions. The fallback alternative filename logic in each grader is a nice touch.

Files Reviewed (6 files)
  • tasks/manifest.yaml
  • tasks/task_meeting_advisory_acronyms.md
  • tasks/task_meeting_advisory_attendees.md
  • tasks/task_meeting_advisory_stakeholders.md
  • tasks/task_meeting_advisory_technical.md
  • tasks/task_meeting_advisory_timeline.md

Reviewed by claude-4.6-sonnet-20260217 · 178,509 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Started

Instance: 96.30.205.187 (Vultr vc2-2c-4gb, ATL)
Instance ID: 91b97dfc-a964-4efd-bfbc-6edeac1ba2a1

Models Being Tested

  • openrouter/anthropic/claude-opus-4.6
  • openrouter/openai/gpt-5.4
  • openrouter/google/gemini-3.1-pro-preview

Tasks Being Tested

  • task_meeting_advisory_attendees
  • task_meeting_advisory_stakeholders
  • task_meeting_advisory_technical
  • task_meeting_advisory_timeline
  • task_meeting_advisory_acronyms

Timeline

  • Started: 2026-04-14 14:46 UTC
  • All 3 models running in parallel
  • Estimated completion: ~30-45 minutes

Automated test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Started (Run 2)

Instance: 66.42.84.134 (Vultr vc2-2c-4gb, ATL)
Instance ID: 2e4bc5aa-e565-4701-95d8-ee8718494369

Models Being Tested

  • openrouter/anthropic/claude-opus-4.6
  • openrouter/openai/gpt-5.4
  • openrouter/google/gemini-3-pro

Tasks Being Tested

  • task_meeting_advisory_attendees
  • task_meeting_advisory_stakeholders
  • task_meeting_advisory_technical
  • task_meeting_advisory_timeline
  • task_meeting_advisory_acronyms

Timeline

  • Started: 2026-04-15 12:08 UTC
  • All 3 models running in parallel
  • Estimated completion: ~30-45 minutes

Automated test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Results — NTIA Advisory Board Tasks

Instance: 66.42.84.134 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/meeting-advisory
Duration: ~23 minutes (all 3 models in parallel)

Score Summary

Task Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro
attendees 89% 84% 85%
stakeholders ⚠️ 12% 90% 88%
technical 85% ⚠️ 27% 83%
timeline 96% 92% 91%
acronyms 91% 94% 94%
Overall 74.6% 77.4% 88.2%

Token Efficiency

Model Total Tokens API Requests Tokens/Task Score/1K Tokens
Opus 4.6 647,735 21 129,547 0.0058
GPT-5.4 661,063 24 132,213 0.0059
Gemini 3.1 Pro 402,879 17 80,576 0.0109

Notable Issues

1. Opus 12% on stakeholders (wrong output file)
The agent produced technical_discussions.md focused on technical topics and working group structure instead of the expected stakeholder_analysis.md. Automated checks scored 0/9 criteria. LLM judge gave 0.29 because the file tangentially mentioned some government entities. This looks like a prompt confusion issue — the agent may have conflated this task with the technical task.

2. GPT-5.4 27% on technical (automated grader file mismatch)
The agent wrote a 26,877-byte technical_discussions.md and the LLM judge scored 0.68 (solid). But automated checks scored 0.0 — likely the grader couldn't find the expected file or the content didn't match regex patterns. The task prompt asks for technical_report.md but the grader checks for technical_discussions.md as a fallback. Worth investigating whether the grader alternatives list covers enough filenames.

3. Opus 0.0 LLM judge on technical despite 100% automated
Opus got perfect automated scores on technical but the LLM judge returned 0. The judge's raw response was empty/malformed. This inflated the automated portion to 84.8% overall but the judge portion contributed nothing. Possible judge timeout or context issue with concurrent judging.

Manifest Issue

The branch adds 11 entries to manifest.yaml but only 5 task files exist. The 6 missing entries (task_meeting_executive_summary, task_meeting_sentiment_analysis, task_meeting_follow_up_email, task_meeting_blog_post, task_meeting_tldr, task_meeting_searchable_index) generate ERROR log lines during loading. These should be removed from the manifest until the task files are ready, or the task files should be included in this PR.

Recommendation: Merge with minor fixes

The 5 advisory board tasks are solid:

  • Good difficulty range: 83-96% for the best model across tasks, with enough variance to differentiate models
  • Gemini's consistent 85-94% across all 5 tasks shows the tasks are well-calibrated
  • The filename mismatches that tanked Opus/GPT scores are agent behavioral issues, not task design flaws
  • Grading functions work correctly; LLM judge rubrics are sensible and provide meaningful differentiation

Suggested fixes before merge:

  1. Remove the 6 manifest entries for non-existent task files
  2. Consider adding a few more alternative filenames to the grader fallback lists (e.g., stakeholders.md, technical_report.md, technical_analysis.md) since models tend to improvise filenames

Automated test by ScuttleBot 🦀 | Instance destroyed after testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants