Problem Description
Once #570 is resolved, we will have all the functionality needed to launch, manage, and terminate benchmarks. The goal of this issue is to update the logic for uploading internal benchmark results.
One of the main problems we want to address is the case where some instances run into Out Of Memory errors. Right now, when that happens, we have to manually add results for the missing jobs in order for the workflow to succeed. We want to remove this manual step and make sure benchmark results are uploaded after a defined amount of time, even if some instances did not finish or failed unexpectedly.
Expected behavior
- When launching our internal benchmarks, we should save the BenchmarkLauncher instance used to launch the benchmark.
- Update the logic used by the upload benchmark workflow. The idea is:
- Load the saved BenchmarkConfig daily.
- Check the status of each instance (Running / Completed / Stopped).
- If all instances have completed, upload the results.
- Otherwise, if some instances are still running but we have reached the deadline for uploading the results (timeout + 1 extra day for instance), stop all remaining instances. Update the results files and for the missing jobs, the error inside the
Error column must be Instance Error
Additional context
In the future, we might consider saving the logs when deleting the remaining instances so we can parse them and identify the exact error that killed the kernel or disrupted the run. This is out of scope for this issue.
Problem Description
Once #570 is resolved, we will have all the functionality needed to launch, manage, and terminate benchmarks. The goal of this issue is to update the logic for uploading internal benchmark results.
One of the main problems we want to address is the case where some instances run into Out Of Memory errors. Right now, when that happens, we have to manually add results for the missing jobs in order for the workflow to succeed. We want to remove this manual step and make sure benchmark results are uploaded after a defined amount of time, even if some instances did not finish or failed unexpectedly.
Expected behavior
Errorcolumn must beInstance ErrorAdditional context
In the future, we might consider saving the logs when deleting the remaining instances so we can parse them and identify the exact error that killed the kernel or disrupted the run. This is out of scope for this issue.