Skip to content

Add waterdata.get_samples_summary for per-location sample inventory#262

Merged
thodson-usgs merged 4 commits intoDOI-USGS:mainfrom
thodson-usgs:add-samples-summary
May 5, 2026
Merged

Add waterdata.get_samples_summary for per-location sample inventory#262
thodson-usgs merged 4 commits intoDOI-USGS:mainfrom
thodson-usgs:add-samples-summary

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Closes #261.

Summary

Adds waterdata.get_samples_summary(monitoringLocationIdentifier=...) — a wrapper around the Samples database /summary/{monitoringLocationIdentifier} endpoint. The endpoint returns one row per (characteristic group, characteristic, user-supplied characteristic) combination with result and activity counts plus first / most recent activity dates, which makes it convenient for taking inventory of what discrete-sample data exists at a site before pulling the underlying observations with get_samples.

This mirrors the R package's summarize_waterdata_samples (read_waterdata_samples.R) feature requested in #261.

The Samples summary endpoint accepts only a single monitoring location per request, so the parameter is typed as str (not str | list[str]).

Live API example

from dataretrieval.waterdata import get_samples_summary

df, md = get_samples_summary(monitoringLocationIdentifier="USGS-04183500")

print(md.url)
# https://api.waterdata.usgs.gov/samples-data/summary/USGS-04183500?mimeType=text%2Fcsv

print(df.columns.tolist())
# ['monitoringLocationIdentifier', 'characteristicGroup', 'characteristic',
#  'characteristicUserSupplied', 'resultCount', 'activityCount',
#  'firstActivity', 'mostRecentActivity']

print(len(df))
# 110

Test plan

  • New test_mock_get_samples_summary covers the happy path against a recorded response (tests/data/samples_summary.txt): URL composition, column names, single-location filter.
  • Live verification: get_samples_summary(monitoringLocationIdentifier="USGS-04183500") returns 110 rows with the expected schema.
  • Full tests/waterdata_test.py suite (27 tests) passes.

thodson-usgs and others added 2 commits May 5, 2026 12:59
Wraps the Samples database /summary/{monitoringLocationIdentifier}
endpoint, mirroring the R package's summarize_waterdata_samples. Returns
per-characteristic result and activity counts plus first / most recent
activity dates for a single monitoring location — useful for taking
inventory of what discrete-sample data exists at a site before pulling
observations with get_samples.

The Samples summary endpoint accepts only a single monitoring location
per request, so the function takes a string (not a list).

Closes DOI-USGS#261.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- URL-encode the path-segment monitoringLocationIdentifier so values
  containing /, ?, # or whitespace cannot break URL composition.
- Log the resolved request URL via PreparedRequest, matching get_samples.
- Loosen the test column assertion from exact-list to subset so a
  non-breaking server-side column addition does not flake the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new public waterdata.get_samples_summary() helper to the waterdata module so users can inspect the discrete-sample inventory available for a single monitoring location before requesting full sample records. It fits the module’s role as the Python wrapper around modern USGS Water Data APIs and mirrors the corresponding R-package capability requested in #261.

Changes:

  • Added waterdata.get_samples_summary(monitoringLocationIdentifier=...) to wrap the Samples /summary/{monitoringLocationIdentifier} CSV endpoint.
  • Exported the new helper from dataretrieval.waterdata and added a recorded-response unit test plus fixture data.
  • Documented the new API addition in NEWS.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
dataretrieval/waterdata/api.py Adds the new Samples summary API wrapper and docstring.
dataretrieval/waterdata/__init__.py Re-exports the new helper as part of the public waterdata API.
tests/waterdata_test.py Adds a mock-based test for URL composition, metadata, and returned columns.
tests/data/samples_summary.txt Provides recorded CSV fixture data for the new test.
NEWS.md Announces the new get_samples_summary capability for users.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dataretrieval/waterdata/api.py
Comment thread tests/waterdata_test.py
thodson-usgs and others added 2 commits May 5, 2026 13:10
…ng for this endpoint

Adapted the wording from R's summarize_waterdata_samples (in the develop
branch of DOI-USGS/dataRetrieval) to match the Python module's docstring
style. Picked up the variety-of-agencies example IDs from the R doc.

Two claims from the R doc were corrected rather than copied:

- The R doc says "Location identifiers should be separated with commas"
  with a multi-ID example. That contradicts the function's own one-site
  check and is wrong for the summary service (which accepts exactly one
  ID). Dropped.
- The R doc says "Location numbers without an agency prefix are assumed
  to have the prefix USGS." That's not true for this endpoint at the API
  level — bare IDs return an empty result with a different column shape.
  Documented the actual behavior instead.

Also switched the example to USGS-04074950 (the site used by the R doc's
example) so the two repos line up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject non-str monitoringLocationIdentifier with a TypeError that
  explains the constraint, instead of letting urllib.parse.quote raise
  a low-level TypeError. This matches R's summarize_waterdata_samples,
  which guards with `if (length(monitoringLocationIdentifier) > 1) stop(...)`.
- Restore characteristicUserSupplied in the column-subset assertion;
  /simplify's "loosen exact-list to subset" was applied too aggressively
  and dropped a real schema column that disambiguates grouping.
- Add a regression test that a list input raises the new TypeError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs requested a review from ldecicco-USGS May 5, 2026 18:16
Copy link
Copy Markdown
Collaborator

@ldecicco-USGS ldecicco-USGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm not entirely sure I follow the "quote" discussion, but seems imply it's the easiest way to test for 1 monitoring location id.

@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

@ldecicco-USGS Thanks for the review! Quick clarification on the quote discussion since the Copilot phrasing was a bit indirect: urllib.parse.quote(monitoringLocationIdentifier, safe='') is just URL-path-segment escaping — it percent-encodes any /, ?, #, or whitespace so user input can't break URL composition. It doesn't validate the shape of the input. The single-site enforcement is the separate isinstance(..., str) guard a few lines above, which raises TypeError("...accepts exactly one monitoring location per request...") for a list (the symptom Copilot was actually pointing at). The two are independent — quote is for URL safety, the type check is for the API constraint.

@thodson-usgs thodson-usgs merged commit 6df40f5 into DOI-USGS:main May 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sample summary information

3 participants