Skip to content

Add research and standalone tasks#327

Open
ScuttleBot wants to merge 1 commit intomainfrom
tasks/research-standalone
Open

Add research and standalone tasks#327
ScuttleBot wants to merge 1 commit intomainfrom
tasks/research-standalone

Conversation

@ScuttleBot
Copy link
Copy Markdown

Add 12 research and standalone tasks to PinchBench.

Tasks Added

  1. task_codebase_navigation — Navigate unfamiliar codebase to find where auth is handled (Closes [task-proposal] Codebase Navigation #143)
  2. task_deep_research — Research WebAssembly outside the browser with primary source citations (Closes [task-proposal] Deep Research #145)
  3. task_competitive_research — Compare GitHub Copilot, Cursor, and Kilo Code (Closes [task-proposal] Competitive Research #146)
  4. task_oss_alternative_research — Find open source alternatives to Notion for self-hosting (Closes [task-proposal] Open Source Alternative Research #147)
  5. task_video_transcript_extraction — Extract YouTube transcript and create structured summary (Closes [task-proposal] Video Transcript Extraction #157)
  6. task_browser_automation — Write Playwright e2e test for a shopping cart HTML page (Closes [task-proposal] Browser Automation #158)
  7. task_pricing_research — Compare managed PostgreSQL pricing across 5 providers (Closes [task-proposal] Pricing Research #163)
  8. task_it_procurement — Research developer laptops for a 50-person startup (Closes [task-proposal] IT Equipment Procurement #164)
  9. task_eu_regulation_research — EU AI Act compliance briefing for AI developer tools (Closes [task-proposal] EU Regulation Research #165)
  10. task_byok_best_practices — Best practices guide for BYOK in AI inference apps (Closes [task-proposal] BYOK Best Practices #166)
  11. task_cron_organizer — Convert natural language to cron expressions (Closes [task-proposal] Cron Job Organizer #167)
  12. task_subway_navigation — Plan NYC subway route from text-based map (Closes [task-proposal] Subway Navigation #168)

Categories

Category Tasks Grading
Research deep_research, competitive_research, oss_alternative_research, pricing_research, it_procurement, eu_regulation_research, byok_best_practices, video_transcript_extraction llm_judge
Developer codebase_navigation, browser_automation hybrid
Productivity cron_organizer automated
Navigation subway_navigation llm_judge

Notes

  • Research tasks use timeout: 300s to allow for web search and report composition
  • browser_automation includes an embedded HTML shopping cart asset (shop.html)
  • subway_navigation uses a text-based subway map (subway_map.md) instead of an image for broad agent compatibility
  • cron_organizer has fully automated grading with exact cron expression matching
  • All tasks include detailed grading criteria and rubrics
  • Lint passes: python3 scripts/lint_manifest.py → OK (65 tasks)

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

This PR is well-structured — each task file has clear prompts, grading criteria, automated checks, and LLM judge rubrics. The manifest entries are consistent with the existing pattern.

Files Reviewed (13 files)
  • tasks/manifest.yaml
  • tasks/task_browser_automation.md
  • tasks/task_byok_best_practices.md
  • tasks/task_codebase_navigation.md
  • tasks/task_competitive_research.md
  • tasks/task_cron_organizer.md
  • tasks/task_deep_research.md
  • tasks/task_eu_regulation_research.md
  • tasks/task_it_procurement.md
  • tasks/task_oss_alternative_research.md
  • tasks/task_pricing_research.md
  • tasks/task_subway_navigation.md
  • tasks/task_video_transcript_extraction.md

Reviewed by claude-4.6-sonnet-20260217 · 146,378 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 Test Started

Branch: tasks/research-standalone
Triggered by: ScuttleBot (automated PR testing)
Time: 2026-04-15 14:03 UTC

Instances

Instance IP Model
pr327-test-1 66.42.84.134 openrouter/anthropic/claude-opus-4.6
pr327-test-2 144.202.21.233 openrouter/openai/gpt-5.4
pr327-test-3 155.138.235.245 openrouter/google/gemini-3.1-pro-preview

Tasks (12 new)

task_codebase_navigation,task_deep_research,task_competitive_research,task_oss_alternative_research,task_video_transcript_extraction,task_browser_automation,task_pricing_research,task_it_procurement,task_eu_regulation_research,task_byok_best_practices,task_cron_organizer,task_subway_navigation

Plan

  • Running all 3 models in parallel on separate Vultr instances (vc2-2c-4gb, ATL)
  • Using --suite filter to run only the 12 new PR tasks
  • Using --no-upload (unofficial test run)
  • ETA: ~45-60 minutes (research tasks have 300s timeouts)

Will post results summary when all runs complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment