Add US pension funds analysis tasks by ScuttleBot · Pull Request #324 · pinchbench/skill

ScuttleBot · 2026-04-14T14:05:28Z

Adds 3 new data-analysis tasks based on the US pension fund CSV dataset (assets/csvs/us_pension_by_state.csv):

task_csv_pension_ranking — Rank states by total pension payments, payee counts, and identify geographic patterns
task_csv_pension_liability — Calculate average payouts, deferred counts, and projected future liabilities
task_csv_pension_risk — Assess risk via deferred-to-payee ratios, concentration analysis, district hotspots, and tier classification

All tasks include hybrid grading (automated Python checks + LLM judge rubric), expected values derived from the actual data, and comprehensive grading criteria.

Closes #244, Closes #247, Closes #257

kilo-code-bot · 2026-04-14T14:08:32Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid work on these three pension analysis tasks. The grading functions are well-structured with proper early-return guards, the CSV asset path follows the established convention (csvs/us_pension_by_state.csv → assets/csvs/...), and the hybrid grading approach (60% automated / 40% LLM judge) provides good coverage for cases where regex checks are intentionally lenient.

Files Reviewed (4 files)

tasks/manifest.yaml
tasks/task_csv_pension_ranking.md
tasks/task_csv_pension_liability.md
tasks/task_csv_pension_risk.md

Fix these issues in Kilo Cloud

_{Reviewed by claude-sonnet-4.6 · 165,457 tokens}

ScuttleBot · 2026-04-15T12:03:34Z

🧪 Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-pension @ f8df529

Models Being Tested

#	Model
1	`openrouter/anthropic/claude-opus-4.6`
2	`openrouter/openai/gpt-5.4`
3	`openrouter/google/gemini-3-pro`

Tasks Being Tested

task_csv_pension_ranking — State ranking by pension payments
task_csv_pension_liability — Liability/exposure analysis
task_csv_pension_risk — Risk assessment with tier classification

Execution Plan

All 3 models running in parallel via background processes.
Suite filter: --suite task_csv_pension_ranking,task_csv_pension_liability,task_csv_pension_risk

Estimated completion: ~15-20 minutes (3 tasks × 180s timeout each, parallel)

Automated PR test by ScuttleBot 🦀

ScuttleBot · 2026-04-15T12:29:19Z

🧪 PR Test Results — US Pension Funds Analysis Tasks

Instance: 155.138.235.245 (vc2-2c-4gb, ATL)
Branch: tasks/csv-pension @ f8df529
Run time: ~22 minutes total (models run in parallel)

Score Summary

Task	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
`task_csv_pension_ranking`	85%	50% ⚠️	86%
`task_csv_pension_liability`	60% ⚠️	90%	98% 🏆
`task_csv_pension_risk`	68%	78%	82%
Overall	70.9%	72.7%	88.7% 🏆

Cost & Efficiency

Metric	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
Total tokens	430,146	355,940	309,149
Total cost	$1.08	$0.44	$0.56
API requests	14	16	16
Score/dollar	1.98	4.98	4.75

Detailed Breakdown

task_csv_pension_ranking — All models scored 100% on automated checks except bottom_states (0%) and grand_total (50%)

Automated checks (all 3 models):

✅ report_created, top_10_listed, ohio_first, pennsylvania_second, florida_third, payee_count_ranking, geographic_analysis — all passed
❌ bottom_states — all models scored 0.0 — all excluded territories, which is arguably correct behavior given the prompt says "excluding territories"
⚠️ grand_total — all scored 0.5 — partial match on formatting

LLM judge divergence:

Claude: Judge gave 0.99/1.0 (excellent)
GPT: Judge gave 0.0/1.0 — judge claimed "parts 2-4 not provided" (transcript chunking issue, not model failure)
Gemini: Judge gave 0.90/1.0

⚠️ Note: GPT's ranking score (50%) is artificially depressed by a judge transcript rendering bug, not actual task performance. Automated checks show GPT performed identically to Claude on this task.

task_csv_pension_liability — Gemini near-perfect, GPT strong, Claude hurt by judge failure

Automated checks: All 3 models scored 100% on ALL automated checks (9/9 passed for each).

LLM judge:

Claude: Judge failed entirely ("no parseable response after all attempts") → scored 60% using only automated checks
GPT: Judge scored 0.75 — "calculations correct, but limited discussion of dimension interactions, financial insight weak"
Gemini: Judge scored 0.94 — "all calculations correct, three dimensions clearly presented, good financial insight"

⚠️ Note: Claude's 60% is entirely due to judge failure. All automated checks passed 100%. Real score is likely 85-95%.

task_csv_pension_risk — Most differentiating task

Automated checks:

Check	Claude	GPT	Gemini
`report_created`	✅	✅	✅
`ratio_ranking`	✅	✅	✅
`dc_highest_ratio`	✅	✅	✅
`nj_second_ratio`	✅	✅	✅
`concentration_risk`	✅	✅	✅
`high_risk_states`	✅	✅	✅
`risk_tiers`	✅	✅	✅
`summary_recommendations`	✅	✅	✅
`district_hotspots`	✅	❌	✅
`oh13_top_district`	✅	❌	❌

LLM judge — Claude got hammered:

Claude: 0.19/1.0 — "calculated wrong metrics, missed deferred-to-payee RATIO, no tier classification attempted"
GPT: 0.75/1.0 — "Strong technical execution, DC/NJ outliers correct, tier framework defined"
Gemini: 0.71/1.0 — "Key ratios correct, all analytical levels present, recommendations generic"

⚠️ Interesting conflict: Claude passed ALL 10 automated checks (including ratio_ranking and dc_highest_ratio) but the LLM judge said it "completely missed the deferred-to-payee RATIO." This suggests a disagreement between what the automated grader detected vs what the judge saw in the full report.

Issues Found

🔴 Judge Reliability Issues

GPT ranking: Judge received incomplete transcript chunks → scored 0/1.0 despite model performing correctly
Claude liability: Judge failed to produce parseable output after all attempts
Claude risk: Automated checks (100%) vs LLM judge (19%) show major disagreement

Recommendation: Investigate transcript chunking for the LLM judge. The current hybrid scoring amplifies judge failures disproportionately.

🟡 `bottom_states` Automated Check

All 3 models scored 0.0 on bottom_states for the ranking task. The prompt says "excluding territories and entries with $0" but the expected values include territories (Northern Mariana Islands, American Samoa, Armed Forces Pacific). Models correctly excluded territories per instructions, then got penalized for it.

Recommendation: Either update the prompt to explicitly include territories in bottom rankings, or update the grader to accept territory-excluded results.

🟡 `grand_total` Partial Match

All models scored 0.5 on grand_total. Likely a formatting sensitivity issue in the regex checks.

🟢 CSV Asset File

The us_pension_by_state.csv asset file loaded correctly for all tasks on all models. workspace_files config works as expected.

Verdict

Aspect	Assessment
Task quality	✅ Good — tasks test meaningful data analysis skills with multi-dimensional thinking
Difficulty	✅ Appropriate — 50-98% range across frontier models, no task trivially easy or impossibly hard
Automated grading	⚠️ Needs fixes — `bottom_states` check penalizes correct behavior, `grand_total` too format-sensitive
LLM judge grading	🔴 Unreliable — 3/9 judge evaluations failed or produced questionable results
Differentiation	✅ Good — risk task especially differentiates model capabilities

Recommendation: Merge with fixes

The tasks themselves are well-designed and test real analytical capability. Before merging:

Fix bottom_states grader — accept territory-excluded results OR clarify prompt
Fix grand_total grader — make regex more tolerant of formatting variants
Consider judge prompt chunking — large transcripts are being split in ways the judge can't handle

The pension risk task is particularly strong — it requires multi-level analysis (state → district), ratio calculations, and framework construction that clearly differentiates models.

Tested by ScuttleBot 🦀 | Instance destroyed after test

Add pension funds CSV analysis tasks

f8df529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add US pension funds analysis tasks#324

Add US pension funds analysis tasks#324
ScuttleBot wants to merge 1 commit intomainfrom
tasks/csv-pension

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 Test Started

Models Being Tested

Tasks Being Tested

Execution Plan

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 PR Test Results — US Pension Funds Analysis Tasks

Score Summary

Cost & Efficiency

Detailed Breakdown

Issues Found

🔴 Judge Reliability Issues

🟡 bottom_states Automated Check

🟡 grand_total Partial Match

🟢 CSV Asset File

Verdict

Recommendation: Merge with fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

🟡 `bottom_states` Automated Check

🟡 `grand_total` Partial Match