Skip to content

Storage - Fix Flaky Stress Tests#48359

Open
browndav-msft wants to merge 11 commits intoAzure:mainfrom
browndav-msft:fix-flaky-tests
Open

Storage - Fix Flaky Stress Tests#48359
browndav-msft wants to merge 11 commits intoAzure:mainfrom
browndav-msft:fix-flaky-tests

Conversation

@browndav-msft
Copy link
Member

This is a fix to fix the issues we've been having with the stress tests.

- read functions had FAIL_FAST which would throw an error when  the stream had reached then end and we wanted to read from the stream again. So we removed from  both reads.
- refactor code so that the exit criteria is a tthe beginning
- refactor the emitContentInfo for dry
- changed emitValue to tryEmitValue
- remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown
- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha
- opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha
@github-actions github-actions bot added Azure.Core azure-core Storage Storage Service (Queues, Blobs, Files) labels Mar 10, 2026
@ibrandes ibrandes marked this pull request as ready for review March 10, 2026 23:35
Copilot AI review requested due to automatic review settings March 10, 2026 23:35
@ibrandes ibrandes changed the title Fix flaky tests Storage - Fix Flaky Stress Tests Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce flakiness in Storage stress tests by making cleanup more resilient, making CRC telemetry streams tolerate re-subscription/double-close behaviors, and aligning dependencies with the chosen OpenTelemetry runtime metrics version.

Changes:

  • Replace unconditional deletes with deleteIfExists() across multiple stress scenarios to avoid cleanup failures when resources are already gone.
  • Add retry/timeout-based global cleanup logic in scenario base classes and add retry logic to async runs.
  • Adjust CRC stream emission behavior to avoid failures on repeated terminal events; downgrade OpenTelemetry instrumentation dependencies to 2.15.0-alpha.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/TelemetryHelper.java Adjust JVM runtime metrics registration and make timeout/cancellation detection null-safe.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java Switch sink emission to tryEmitValue to tolerate double-close.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcInputStream.java Refactor EOF emission, add resubscription state reset, and switch to tryEmitValue.
sdk/storage/azure-storage-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/UploadFromFile.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-share-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/UploadFromFile.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/Upload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-datalake-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/UploadPages.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/Upload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/StageBlock.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobOutputStream.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/CommitBlockList.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobUpload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobOutputStream.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java Add retrying global cleanup + async retry behavior and structured logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlock.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlobOutputStream.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/parents/azure-client-sdk-parent/pom.xml Downgrade io.clientcore:linting-extensions used by checkstyle plugin from beta.2 to beta.1.

Comment on lines +117 to +121
private Mono<Void> deleteAllFilesInShare() {
return asyncNoFaultShareClient.getDirectoryClient("").listFilesAndDirectories()
.flatMap(fileRef ->
asyncNoFaultShareClient.getFileClient(fileRef.getName()).delete())
.then()
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanupShareWithRetry/deleteAllFilesInShare currently treats every ShareFileItem as a file (getFileClient(...).delete()) and doesn't handle directories. Since listFilesAndDirectories() returns both files and directories, this will fail for directories and likely prevent share deletion, defeating the retry cleanup intent. Please branch on fileRef.isDirectory() and delete directories (ideally recursively, deleting children before the directory) as well as files.

Copilot uses AI. Check for mistakes.
@browndav-msft browndav-msft requested a review from ibrandes March 11, 2026 17:04
Copy link
Member

@ibrandes ibrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i resolved all comments that i don't think need to be addressed right now, but there are still a couple lingering ones (make sure you expand hidden conversations). not sure how important the ones about globalCleanupAsync are, but i think we should address the ones about doOnError and deleteAllFilesInFileSystem, thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure.Core azure-core Storage Storage Service (Queues, Blobs, Files)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants