Skip to content

Add US pension funds analysis tasks#324

Open
ScuttleBot wants to merge 1 commit intomainfrom
tasks/csv-pension
Open

Add US pension funds analysis tasks#324
ScuttleBot wants to merge 1 commit intomainfrom
tasks/csv-pension

Conversation

@ScuttleBot
Copy link
Copy Markdown

Adds 3 new data-analysis tasks based on the US pension fund CSV dataset (assets/csvs/us_pension_by_state.csv):

  1. task_csv_pension_ranking — Rank states by total pension payments, payee counts, and identify geographic patterns
  2. task_csv_pension_liability — Calculate average payouts, deferred counts, and projected future liabilities
  3. task_csv_pension_risk — Assess risk via deferred-to-payee ratios, concentration analysis, district hotspots, and tier classification

All tasks include hybrid grading (automated Python checks + LLM judge rubric), expected values derived from the actual data, and comprehensive grading criteria.

Closes #244, Closes #247, Closes #257

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid work on these three pension analysis tasks. The grading functions are well-structured with proper early-return guards, the CSV asset path follows the established convention (csvs/us_pension_by_state.csvassets/csvs/...), and the hybrid grading approach (60% automated / 40% LLM judge) provides good coverage for cases where regex checks are intentionally lenient.

Files Reviewed (4 files)
  • tasks/manifest.yaml
  • tasks/task_csv_pension_ranking.md
  • tasks/task_csv_pension_liability.md
  • tasks/task_csv_pension_risk.md

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 165,457 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-pension @ f8df529

Models Being Tested

# Model
1 openrouter/anthropic/claude-opus-4.6
2 openrouter/openai/gpt-5.4
3 openrouter/google/gemini-3-pro

Tasks Being Tested

  • task_csv_pension_ranking — State ranking by pension payments
  • task_csv_pension_liability — Liability/exposure analysis
  • task_csv_pension_risk — Risk assessment with tier classification

Execution Plan

All 3 models running in parallel via background processes.
Suite filter: --suite task_csv_pension_ranking,task_csv_pension_liability,task_csv_pension_risk

Estimated completion: ~15-20 minutes (3 tasks × 180s timeout each, parallel)

Automated PR test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Results — US Pension Funds Analysis Tasks

Instance: 155.138.235.245 (vc2-2c-4gb, ATL)
Branch: tasks/csv-pension @ f8df529
Run time: ~22 minutes total (models run in parallel)

Score Summary

Task Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro
task_csv_pension_ranking 85% 50% ⚠️ 86%
task_csv_pension_liability 60% ⚠️ 90% 98% 🏆
task_csv_pension_risk 68% 78% 82%
Overall 70.9% 72.7% 88.7% 🏆

Cost & Efficiency

Metric Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro
Total tokens 430,146 355,940 309,149
Total cost $1.08 $0.44 $0.56
API requests 14 16 16
Score/dollar 1.98 4.98 4.75

Detailed Breakdown

task_csv_pension_ranking — All models scored 100% on automated checks except bottom_states (0%) and grand_total (50%)

Automated checks (all 3 models):

  • report_created, top_10_listed, ohio_first, pennsylvania_second, florida_third, payee_count_ranking, geographic_analysis — all passed
  • bottom_statesall models scored 0.0 — all excluded territories, which is arguably correct behavior given the prompt says "excluding territories"
  • ⚠️ grand_totalall scored 0.5 — partial match on formatting

LLM judge divergence:

  • Claude: Judge gave 0.99/1.0 (excellent)
  • GPT: Judge gave 0.0/1.0 — judge claimed "parts 2-4 not provided" (transcript chunking issue, not model failure)
  • Gemini: Judge gave 0.90/1.0

⚠️ Note: GPT's ranking score (50%) is artificially depressed by a judge transcript rendering bug, not actual task performance. Automated checks show GPT performed identically to Claude on this task.

task_csv_pension_liability — Gemini near-perfect, GPT strong, Claude hurt by judge failure

Automated checks: All 3 models scored 100% on ALL automated checks (9/9 passed for each).

LLM judge:

  • Claude: Judge failed entirely ("no parseable response after all attempts") → scored 60% using only automated checks
  • GPT: Judge scored 0.75 — "calculations correct, but limited discussion of dimension interactions, financial insight weak"
  • Gemini: Judge scored 0.94 — "all calculations correct, three dimensions clearly presented, good financial insight"

⚠️ Note: Claude's 60% is entirely due to judge failure. All automated checks passed 100%. Real score is likely 85-95%.

task_csv_pension_risk — Most differentiating task

Automated checks:

Check Claude GPT Gemini
report_created
ratio_ranking
dc_highest_ratio
nj_second_ratio
concentration_risk
high_risk_states
risk_tiers
summary_recommendations
district_hotspots
oh13_top_district

LLM judge — Claude got hammered:

  • Claude: 0.19/1.0 — "calculated wrong metrics, missed deferred-to-payee RATIO, no tier classification attempted"
  • GPT: 0.75/1.0 — "Strong technical execution, DC/NJ outliers correct, tier framework defined"
  • Gemini: 0.71/1.0 — "Key ratios correct, all analytical levels present, recommendations generic"

⚠️ Interesting conflict: Claude passed ALL 10 automated checks (including ratio_ranking and dc_highest_ratio) but the LLM judge said it "completely missed the deferred-to-payee RATIO." This suggests a disagreement between what the automated grader detected vs what the judge saw in the full report.

Issues Found

🔴 Judge Reliability Issues

  1. GPT ranking: Judge received incomplete transcript chunks → scored 0/1.0 despite model performing correctly
  2. Claude liability: Judge failed to produce parseable output after all attempts
  3. Claude risk: Automated checks (100%) vs LLM judge (19%) show major disagreement

Recommendation: Investigate transcript chunking for the LLM judge. The current hybrid scoring amplifies judge failures disproportionately.

🟡 bottom_states Automated Check

All 3 models scored 0.0 on bottom_states for the ranking task. The prompt says "excluding territories and entries with $0" but the expected values include territories (Northern Mariana Islands, American Samoa, Armed Forces Pacific). Models correctly excluded territories per instructions, then got penalized for it.

Recommendation: Either update the prompt to explicitly include territories in bottom rankings, or update the grader to accept territory-excluded results.

🟡 grand_total Partial Match

All models scored 0.5 on grand_total. Likely a formatting sensitivity issue in the regex checks.

🟢 CSV Asset File

The us_pension_by_state.csv asset file loaded correctly for all tasks on all models. workspace_files config works as expected.

Verdict

Aspect Assessment
Task quality ✅ Good — tasks test meaningful data analysis skills with multi-dimensional thinking
Difficulty ✅ Appropriate — 50-98% range across frontier models, no task trivially easy or impossibly hard
Automated grading ⚠️ Needs fixes — bottom_states check penalizes correct behavior, grand_total too format-sensitive
LLM judge grading 🔴 Unreliable — 3/9 judge evaluations failed or produced questionable results
Differentiation ✅ Good — risk task especially differentiates model capabilities

Recommendation: Merge with fixes

The tasks themselves are well-designed and test real analytical capability. Before merging:

  1. Fix bottom_states grader — accept territory-excluded results OR clarify prompt
  2. Fix grand_total grader — make regex more tolerant of formatting variants
  3. Consider judge prompt chunking — large transcripts are being split in ways the judge can't handle

The pension risk task is particularly strong — it requires multi-level analysis (state → district), ratio calculations, and framework construction that clearly differentiates models.


Tested by ScuttleBot 🦀 | Instance destroyed after test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: csv_pension_risk Task: csv_pension_liability Task: csv_pension_ranking

2 participants