[tx] Tinker CI + bugfixes#1616
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a custom LoRA loading endpoint /skyrl/v1/load_lora_adapter to bypass upstream vLLM bugs and implements a port striding mechanism (SERVER_PORT_STRIDE) to prevent port conflicts between inference servers. It also adds CI configuration and E2E test scripts for the Tinker backend. Feedback includes adding a check for LoRA support in the new endpoint to avoid potential attribute errors and ensuring consistent return formats in the inference client.
| ) | ||
|
|
||
| models = request.app.state.openai_serving_models | ||
| async with models.lora_resolver_lock[lora_name]: |
There was a problem hiding this comment.
The _skyrl_load_lora_adapter endpoint assumes that LoRA is enabled on the server. If the server is started without LoRA support (e.g., --enable-lora is not passed to vLLM), models.lora_requests and models.lora_resolver_lock may not be initialized, leading to an AttributeError or TypeError. A check should be added to ensure LoRA is enabled before accessing these attributes.
if not hasattr(models, "lora_requests") or models.lora_requests is None:
raise HTTPException(
status_code=400,
detail="LoRA is not enabled on this server.",
)
async with models.lora_resolver_lock[lora_name]:| body = await resp.json() | ||
| raise_for_status(resp, body) | ||
| return server_url, {"status": resp.status, "body": await resp.text()} | ||
| return server_url, await resp.json() |
There was a problem hiding this comment.
The return value of _load_on_server is inconsistent with other control plane methods in this class (e.g., _call_server at line 857), which return a dictionary containing both the HTTP status and the response body. This inconsistency may cause issues for callers expecting the standard {url: {"status": ..., "body": ...}} format.
| return server_url, await resp.json() | |
| return server_url, {"status": resp.status, "body": await resp.json()} |
| # matching the convention in gsm8k_colocate.sh. | ||
| REWARD_MIN_VALUE=0.0 | ||
|
|
||
| BACKEND_CONFIG='{"trainer.placement.colocate_all": true, "trainer.placement.policy_num_gpus_per_node": 4, "trainer.micro_forward_batch_size_per_gpu": 8, "trainer.micro_train_batch_size_per_gpu": 8, "generator.inference_engine.num_engines": 4, "generator.inference_engine.tensor_parallel_size": 1, "generator.inference_engine.backend": "vllm", "generator.inference_engine.run_engines_locally": true, "generator.inference_engine.weight_sync_backend": "nccl", "generator.inference_engine.async_engine": true, "generator.inference_engine.gpu_memory_utilization": 0.8, "generator.batched": true}' |
There was a problem hiding this comment.
I would prefer if the default tinker test is non-colocated, 2 GPUs for trainer and 2 for inference.
There was a problem hiding this comment.
Actually can you just add a separate non-colocated E2E test?
This PR introduces a tinker ci, closing #1553. It uses tinker_cookbook.recipes.math_rl.train. Currently the
REWARD_MIN_VALUEis set to 0, but will be tightened once we have a couple of runs. Added both colocated and fully async tests.https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_tinker_ci?nw=nwuserahao
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_tinker_fully_async_ci?nw=nwuserahao
While testing out the tinker CI, found a port collision bug.
There exists a window of time where we free port so vllm can take it, but other actors can also. Fixed this by allocating a specific, non-overlapping port range for each actor.
Additionally, changed the disk reloading workaround #1606 created here to more closely mirror the upstream change vllm-project/vllm#41482, adding a new custom endpoint with the load_inplace override.