Skip to content

feat(evaluation): add DatasetClient and dataset management service provider#491

Merged
jariy17 merged 19 commits into
mainfrom
feat/dataset-management-sdk
May 22, 2026
Merged

feat(evaluation): add DatasetClient and dataset management service provider#491
jariy17 merged 19 commits into
mainfrom
feat/dataset-management-sdk

Conversation

@jariy17
Copy link
Copy Markdown
Contributor

@jariy17 jariy17 commented May 21, 2026

Summary

  • Adds DatasetClient, a high-level wrapper for Bedrock AgentCore dataset management operations (create/get/list/delete datasets and versions, upload/download via presigned URLs, polling helpers).
  • Adds DatasetManagementServiceProvider (formerly ServiceDatasetProvider) that loads evaluation scenarios from a managed dataset, alongside the existing FileDatasetProvider which now also supports JSONL files.
  • Refactors Scenario classes so each owns its schema_type, and moves _parse_scenario to module level for reuse between file- and service-backed providers.
  • Streams JSONL downloads, extracts region from agent ARN (no separate BEDROCK_TEST_REGION), and consolidates test fixtures.
  • Unit and integration tests for the new client and provider, plus a runner integ test that exercises a real agent.

Test plan

  • uv run pytest tests/bedrock_agentcore/evaluation passes
  • uv run pytest tests_integ/evaluation passes against a configured AWS account
  • Lint/format checks pass (ruff, line-length)
  • Manual: create dataset → upload JSONL → run evaluation via DatasetManagementServiceProvider end-to-end

jariy17 added 18 commits May 21, 2026 10:25
Add Dataset Management SDK support with:
- DatasetClient: pass-through client for all 11 dataset APIs
  (create/get/list/update/delete datasets, versions, examples)
  with 6 _and_wait helpers for async operations
- ServiceDatasetProvider: fetches datasets from the service and
  returns SDK Dataset objects compatible with OnDemandEvaluationDatasetRunner
  and BatchEvaluationRunner
- Unit tests (20 tests) and integration tests (17 tests, verified against
  live AWS)
Switch ServiceDatasetProvider from list_dataset_examples pagination
to downloading the JSONL file via the presigned downloadUrl from
GetDataset. Single HTTP request is simpler and faster for large datasets.
- helpers.py: get_or_create_agent_runtime(), make_agent_invoker()
  with retry logic and warmup for cold start handling
- test_runners_with_service_dataset.py: OnDemandRunner + ServiceDatasetProvider
  end-to-end test (skipped until a working deployed agent is available —
  current account has 30s init timeout that prevents cold starts)
Update test_runners_with_service_dataset.py to use env-var config:
- INTEG_AGENT_RUNTIME_ARN: skips if not set
- BEDROCK_TEST_REGION: region matching the agent
- Verified end-to-end: ServiceDatasetProvider → OnDemandRunner → real agent invocation → COMPLETED
- ServiceDatasetProvider: accept client in __init__ (eliminates region_name)
- ServiceDatasetProvider: validate schemaType against supported runner schemas
- ServiceDatasetProvider: proper error message on download failure
- Remove helpers.py (not needed)
- Add unit tests for unsupported schema and download failure cases
- ServiceDatasetProvider: import DatasetClient at top, default in __init__
- ScenarioExecutor: add schema_type field, override in Predefined/Simulated
- ServiceDatasetProvider: collect supported schemas dynamically from executors
- delete_dataset_and_wait: add DELETE_FAILED as failed status
…dule level

- Add schema_type field to Scenario base, PredefinedScenario, SimulatedScenario
- Remove schema_type from ScenarioExecutor (doesn't belong there)
- Move _parse_scenario from FileDatasetProvider to module-level function
- SUPPORTED_SCHEMA_TYPES derived from Scenario classes directly
- Add timeout=60 to requests.get() for presigned URL download (#1)
- Use r.content.decode("utf-8") instead of r.text for explicit encoding (#3)
- Replace model_fields introspection with plain constant set (#8)
- Guard __getattr__ against recursion when _cp_client not initialized (#5)
- Remove dead _mock_client function from tests (#11)
- Stream JSONL via iter_lines() instead of loading entire file into memory (#2)
- Consolidate repetitive test mock setup with pytest fixtures (#12)
…entServiceProvider

Addresses PR review feedback: the name makes explicit which service
the provider loads datasets from (Dataset Management Service).
Dispatch on file extension: paths ending in .jsonl are read line-by-line
(one scenario per line). All other paths keep the existing
{"scenarios": [...]} JSON shape.

Adds 8 unit tests covering predefined/simulated/mixed JSONL content,
blank-line tolerance, malformed lines, and extension dispatch.
…d_wait

A version-specific delete (datasetVersion provided) does not remove the
dataset itself — it transitions the dataset to UPDATING and back to ACTIVE.
The previous waiter polled for ResourceNotFoundException and timed out.

Branch on whether datasetVersion is passed:
- Without datasetVersion: poll until ResourceNotFoundException (DELETE_FAILED)
- With datasetVersion: poll until ACTIVE (UPDATE_FAILED), return dataset dict

Add unit tests for both version-delete paths (success + UPDATE_FAILED)
and an integ test that creates two versions, deletes the oldest via
delete_dataset_and_wait, and verifies the dataset stays ACTIVE.
@jariy17 jariy17 requested a review from a team May 21, 2026 14:38
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

✅ No Breaking Changes Detected

No public API breaking changes found in this PR.

Copy link
Copy Markdown
Contributor

@Hweinstock Hweinstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread src/bedrock_agentcore/evaluation/runner/dataset_providers.py
@jariy17 jariy17 enabled auto-merge (squash) May 22, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants