Skip to content

Sbp 364 dataset#51

Open
vtnphan wants to merge 8 commits into
mainfrom
sbp-364-dataset
Open

Sbp 364 dataset#51
vtnphan wants to merge 8 commits into
mainfrom
sbp-364-dataset

Conversation

@vtnphan
Copy link
Copy Markdown
Collaborator

@vtnphan vtnphan commented May 28, 2026

Pull Request

Summary

SBP-XXX Add interaction screening dataset upload endpoint. This introduces a dedicated samplesheet builder and API endpoint for the interaction screening workflow, which requires a structured CSV with sequence file paths and group assignments (query/targetg1/g2).

Changes

  • New endpoint POST /datasets/interaction-screening/upload in routes/workflows.py — accepts a list of sequences with id/group and a runId, creates a named Seqera dataset, and uploads the generated samplesheet
  • New service function upload_interaction_screening_dataset() in services/datasets.py — builds the samplesheet CSV (id, sequence, group, type columns) and uploads it to Seqera; sequence file paths are resolved under /g/data/yz52/sbp-service/input/interaction_screening/<unique-run-path>/; groups map queryg1, targetg2
  • New helper build_unique_dataset_name() in services/datasets.py — generates a timestamped, slug-safe name with a 4-char random suffix; replaces the previous millisecond-timestamp approach
  • New schemas SequenceItem (id: str, group: Literal["query", "target"]) and InteractionScreeningDatasetUploadRequest (sequences, runId) in schemas/workflows.py
  • Breaking DatasetUploadRequest — removed datasetName and datasetDescription fields; dataset naming is now handled automatically by build_unique_dataset_name
  • Breaking create_seqera_dataset() signature changed from (name: str | None, description: str | None) to (name: str = "dataset"); description is no longer sent to the Seqera API
  • Security /datasets/upload now requires require_workflow_execution_role (previously unprotected)
  • Tests added coverage for the new service function, route, and schemas in test_services_datasets.py, test_additional_coverage.py, and test_schemas.py

How to Test

  1. Start the backend locally with valid Seqera env vars (SEQERA_API_URL, SEQERA_ACCESS_TOKEN, WORK_SPACE)

  2. Call the new endpoint:

curl -X POST http://localhost:8000/api/workflows/datasets/interaction-screening/upload \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "runId": "my-test-run",
    "sequences": [
      {"id": "seq_A", "group": "query"},
      {"id": "seq_B", "group": "target"}
    ]
  }'
  1. Verify the response contains a datasetId and "success": true

  2. In Seqera Platform, confirm a dataset was created with a uniquely named samplesheet CSV containing the correct g1/g2 groups and file paths under interaction_screening/

  3. Run tests:

.venv/bin/python -m pytest tests/test_services_datasets.py tests/test_additional_coverage.py tests/test_schemas.py -v

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added or updated documentation where necessary
  • I have run linting and unit tests locally
  • The code follows the project's style guidelines

@vtnphan vtnphan marked this pull request as ready for review May 28, 2026 06:54
@vtnphan vtnphan requested a review from marius-mather May 28, 2026 06:54
Copy link
Copy Markdown
Collaborator

@marius-mather marius-mather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few suggestions for making the code a bit cleaner

Comment thread app/routes/workflows.py

await asyncio.sleep(2)

sequences = [{"id": s.id, "group": s.group} for s in payload.sequences]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to convert this to dicts, just keep it as the pydantic type. That will make the type clearer on upload_interaction_screening_dataset too

Comment thread app/routes/workflows.py
dataset = await create_seqera_dataset(
name=payload.datasetName, description=payload.datasetDescription
)
dataset = await create_seqera_dataset()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supply a name explicitly, even if it's just dataset

Comment thread app/routes/workflows.py
@router.post(
"/datasets/upload",
response_model=DatasetUploadResponse,
dependencies=[Depends(require_workflow_execution_role)],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all the endpoints under workflows require the workflow execution role? If so, apply it to the router

Comment thread app/services/datasets.py
return str(value)


def build_unique_dataset_name(name: str) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring with an example of what the name looks like

Comment thread app/services/datasets.py
for s in sequences
]

output = io.StringIO()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good habit to call output.close() after you're done with the file (or you can use with io.iStringIO() as output: to automatically close the file after the with: block

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants