Skip to content

Update the upload benchmark workflow to work with the BenchmarkLauncher #584

@R-Palazzo

Description

@R-Palazzo

Problem Description

Once #570 is resolved, we will have all the functionality needed to launch, manage, and terminate benchmarks. The goal of this issue is to update the logic for uploading internal benchmark results.

One of the main problems we want to address is the case where some instances run into Out Of Memory errors. Right now, when that happens, we have to manually add results for the missing jobs in order for the workflow to succeed. We want to remove this manual step and make sure benchmark results are uploaded after a defined amount of time, even if some instances did not finish or failed unexpectedly.

Expected behavior

  • When launching our internal benchmarks, we should save the BenchmarkLauncher instance used to launch the benchmark.
  • Update the logic used by the upload benchmark workflow. The idea is:
    • Load the saved BenchmarkConfig daily.
    • Check the status of each instance (Running / Completed / Stopped).
    • If all instances have completed, upload the results.
    • Otherwise, if some instances are still running but we have reached the deadline for uploading the results (timeout + 1 extra day for instance), stop all remaining instances. Update the results files and for the missing jobs, the error inside the Error column must be Instance Error

Additional context

In the future, we might consider saving the logs when deleting the remaining instances so we can parse them and identify the exact error that killed the kernel or disrupted the run. This is out of scope for this issue.

Metadata

Metadata

Assignees

Labels

feature requestRequest for a new featureinternalThe issue doesn't change the API or functionality

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions