Skip to content

feat(evaluation): add DatasetClient and dataset management service provider#490

Closed
jariy17 wants to merge 18 commits into
aws:mainfrom
jariy17:feat/dataset-management-sdk
Closed

feat(evaluation): add DatasetClient and dataset management service provider#490
jariy17 wants to merge 18 commits into
aws:mainfrom
jariy17:feat/dataset-management-sdk

Conversation

@jariy17
Copy link
Copy Markdown
Contributor

@jariy17 jariy17 commented May 21, 2026

Summary

  • Adds DatasetClient, a high-level wrapper for Bedrock AgentCore dataset management operations (create/get/list/delete datasets and versions, upload/download via presigned URLs, polling helpers).
  • Adds DatasetManagementServiceProvider (formerly ServiceDatasetProvider) that loads evaluation scenarios from a managed dataset, alongside the existing FileDatasetProvider which now also supports JSONL files.
  • Refactors Scenario classes so each owns its schema_type, and moves _parse_scenario to module level for reuse between file- and service-backed providers.
  • Streams JSONL downloads, extracts region from agent ARN (no separate BEDROCK_TEST_REGION), and consolidates test fixtures.
  • Unit and integration tests for the new client and provider, plus a runner integ test that exercises a real agent.

Test plan

  • uv run pytest tests/bedrock_agentcore/evaluation passes
  • uv run pytest tests_integ/evaluation passes against a configured AWS account
  • Lint/format checks pass (ruff, line-length)
  • Manual: create dataset → upload JSONL → run evaluation via DatasetManagementServiceProvider end-to-end

jariy17 added 18 commits May 21, 2026 10:25
Add Dataset Management SDK support with:
- DatasetClient: pass-through client for all 11 dataset APIs
  (create/get/list/update/delete datasets, versions, examples)
  with 6 _and_wait helpers for async operations
- ServiceDatasetProvider: fetches datasets from the service and
  returns SDK Dataset objects compatible with OnDemandEvaluationDatasetRunner
  and BatchEvaluationRunner
- Unit tests (20 tests) and integration tests (17 tests, verified against
  live AWS)
Switch ServiceDatasetProvider from list_dataset_examples pagination
to downloading the JSONL file via the presigned downloadUrl from
GetDataset. Single HTTP request is simpler and faster for large datasets.
- helpers.py: get_or_create_agent_runtime(), make_agent_invoker()
  with retry logic and warmup for cold start handling
- test_runners_with_service_dataset.py: OnDemandRunner + ServiceDatasetProvider
  end-to-end test (skipped until a working deployed agent is available —
  current account has 30s init timeout that prevents cold starts)
Update test_runners_with_service_dataset.py to use env-var config:
- INTEG_AGENT_RUNTIME_ARN: skips if not set
- BEDROCK_TEST_REGION: region matching the agent
- Verified end-to-end: ServiceDatasetProvider → OnDemandRunner → real agent invocation → COMPLETED
- ServiceDatasetProvider: accept client in __init__ (eliminates region_name)
- ServiceDatasetProvider: validate schemaType against supported runner schemas
- ServiceDatasetProvider: proper error message on download failure
- Remove helpers.py (not needed)
- Add unit tests for unsupported schema and download failure cases
- ServiceDatasetProvider: import DatasetClient at top, default in __init__
- ScenarioExecutor: add schema_type field, override in Predefined/Simulated
- ServiceDatasetProvider: collect supported schemas dynamically from executors
- delete_dataset_and_wait: add DELETE_FAILED as failed status
…dule level

- Add schema_type field to Scenario base, PredefinedScenario, SimulatedScenario
- Remove schema_type from ScenarioExecutor (doesn't belong there)
- Move _parse_scenario from FileDatasetProvider to module-level function
- SUPPORTED_SCHEMA_TYPES derived from Scenario classes directly
- Add timeout=60 to requests.get() for presigned URL download (#1)
- Use r.content.decode("utf-8") instead of r.text for explicit encoding (#3)
- Replace model_fields introspection with plain constant set (#8)
- Guard __getattr__ against recursion when _cp_client not initialized (#5)
- Remove dead _mock_client function from tests (aws#11)
- Stream JSONL via iter_lines() instead of loading entire file into memory (#2)
- Consolidate repetitive test mock setup with pytest fixtures (aws#12)
…entServiceProvider

Addresses PR review feedback: the name makes explicit which service
the provider loads datasets from (Dataset Management Service).
Dispatch on file extension: paths ending in .jsonl are read line-by-line
(one scenario per line). All other paths keep the existing
{"scenarios": [...]} JSON shape.

Adds 8 unit tests covering predefined/simulated/mixed JSONL content,
blank-line tolerance, malformed lines, and extension dispatch.
…d_wait

A version-specific delete (datasetVersion provided) does not remove the
dataset itself — it transitions the dataset to UPDATING and back to ACTIVE.
The previous waiter polled for ResourceNotFoundException and timed out.

Branch on whether datasetVersion is passed:
- Without datasetVersion: poll until ResourceNotFoundException (DELETE_FAILED)
- With datasetVersion: poll until ACTIVE (UPDATE_FAILED), return dataset dict

Add unit tests for both version-delete paths (success + UPDATE_FAILED)
and an integ test that creates two versions, deletes the oldest via
delete_dataset_and_wait, and verifies the dataset stays ACTIVE.
@jariy17 jariy17 requested a review from a team May 21, 2026 14:26
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.11538% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@c311682). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/bedrock_agentcore/evaluation/dataset_client.py 96.36% 1 Missing and 1 partial ⚠️
...k_agentcore/evaluation/runner/dataset_providers.py 97.77% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #490   +/-   ##
=======================================
  Coverage        ?   89.49%           
=======================================
  Files           ?       84           
  Lines           ?     7732           
  Branches        ?     1157           
=======================================
  Hits            ?     6920           
  Misses          ?      515           
  Partials        ?      297           
Flag Coverage Δ
unittests 89.49% <97.11%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jariy17
Copy link
Copy Markdown
Contributor Author

jariy17 commented May 21, 2026

Re-opening as an in-repo PR to fix the fork-PR token permission issue (breaking-change check failed only on the comment-post step due to GitHub's read-only token policy for fork PRs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants