[tx] Tinker CI + bugfixes by hao-aaron · Pull Request #1616 · NovaSky-AI/SkyRL

hao-aaron · 2026-05-04T19:50:28Z

This PR introduces a tinker ci, closing #1553. It uses tinker_cookbook.recipes.math_rl.train. Currently the REWARD_MIN_VALUE is set to 0, but will be tightened once we have a couple of runs. Added both colocated and fully async tests.

https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_tinker_ci?nw=nwuserahao
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_tinker_fully_async_ci?nw=nwuserahao

While testing out the tinker CI, found a port collision bug.

async def _run_server(self) -> None:
    """Internal method to run the HTTP server."""
    # Release the port reservation right before vLLM rebinds.
    if self._port_reservation is not None:
        self._port_reservation.close()                          # ← step A
        self._port_reservation = None
    ...
    sock_addr = (self._cli_args.host, self._cli_args.port)
    sock = create_server_socket(sock_addr)

There exists a window of time where we free port so vllm can take it, but other actors can also. Fixed this by allocating a specific, non-overlapping port range for each actor.

Additionally, changed the disk reloading workaround #1606 created here to more closely mirror the upstream change vllm-project/vllm#41482, adding a new custom endpoint with the load_inplace override.

Signed-off-by: hao-aaron <ahao@anyscale.com>

…inker-ci

Signed-off-by: hao-aaron <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a custom LoRA loading endpoint /skyrl/v1/load_lora_adapter to bypass upstream vLLM bugs and implements a port striding mechanism (SERVER_PORT_STRIDE) to prevent port conflicts between inference servers. It also adds CI configuration and E2E test scripts for the Tinker backend. Feedback includes adding a check for LoRA support in the new endpoint to avoid potential attribute errors and ensuring consistent return formats in the inference client.

gemini-code-assist · 2026-05-04T23:38:37Z

+                )
+
+            models = request.app.state.openai_serving_models
+            async with models.lora_resolver_lock[lora_name]:


The _skyrl_load_lora_adapter endpoint assumes that LoRA is enabled on the server. If the server is started without LoRA support (e.g., --enable-lora is not passed to vLLM), models.lora_requests and models.lora_resolver_lock may not be initialized, leading to an AttributeError or TypeError. A check should be added to ensure LoRA is enabled before accessing these attributes.

if not hasattr(models, "lora_requests") or models.lora_requests is None: raise HTTPException( status_code=400, detail="LoRA is not enabled on this server.", ) async with models.lora_resolver_lock[lora_name]:

gemini-code-assist · 2026-05-04T23:38:37Z

                    body = await resp.json()
                    raise_for_status(resp, body)
-                return server_url, {"status": resp.status, "body": await resp.text()}
+                return server_url, await resp.json()


The return value of _load_on_server is inconsistent with other control plane methods in this class (e.g., _call_server at line 857), which return a dictionary containing both the HTTP status and the response body. This inconsistency may cause issues for callers expecting the standard {url: {"status": ..., "body": ...}} format.

Suggested change

return server_url, await resp.json()

return server_url, {"status": resp.status, "body": await resp.json()}

SumanthRH · 2026-05-05T17:38:00Z

+# matching the convention in gsm8k_colocate.sh.
+REWARD_MIN_VALUE=0.0
+
+BACKEND_CONFIG='{"trainer.placement.colocate_all": true, "trainer.placement.policy_num_gpus_per_node": 4, "trainer.micro_forward_batch_size_per_gpu": 8, "trainer.micro_train_batch_size_per_gpu": 8, "generator.inference_engine.num_engines": 4, "generator.inference_engine.tensor_parallel_size": 1, "generator.inference_engine.backend": "vllm", "generator.inference_engine.run_engines_locally": true, "generator.inference_engine.weight_sync_backend": "nccl", "generator.inference_engine.async_engine": true, "generator.inference_engine.gpu_memory_utilization": 0.8, "generator.batched": true}'


I would prefer if the default tinker test is non-colocated, 2 GPUs for trainer and 2 for inference.

Actually can you just add a separate non-colocated E2E test?

Signed-off-by: hao-aaron <ahao@anyscale.com>

hao-aaron added 6 commits May 4, 2026 17:12

x

6c4e53b

Signed-off-by: hao-aaron <ahao@anyscale.com>

x

216262a

Signed-off-by: hao-aaron <ahao@anyscale.com>

x

351c0ef

Signed-off-by: hao-aaron <ahao@anyscale.com>

Merge branch 'tinker-ci' of https://github.com/hao-aaron/SkyRL into t…

89bf89a

…inker-ci

x

9015836

Signed-off-by: hao-aaron <ahao@anyscale.com>

x

d6963e9

Signed-off-by: hao-aaron <ahao@anyscale.com>

hao-aaron changed the title ~~[WIP] Tinker CI~~ [WIP] Tinker CI + bugfixes May 4, 2026

hao-aaron changed the title ~~[WIP] Tinker CI + bugfixes~~ [tx] Tinker CI + bugfixes May 4, 2026

hao-aaron marked this pull request as ready for review May 4, 2026 23:36

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

SumanthRH reviewed May 5, 2026

View reviewed changes

x

e7b42f4

Signed-off-by: hao-aaron <ahao@anyscale.com>

hao-aaron mentioned this pull request May 6, 2026

[bug] Lora bug fixes + port collision fix #1625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] Tinker CI + bugfixes#1616

[tx] Tinker CI + bugfixes#1616
hao-aaron wants to merge 7 commits intoNovaSky-AI:mainfrom
hao-aaron:tinker-ci

hao-aaron commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

SumanthRH May 5, 2026

Uh oh!

SumanthRH May 5, 2026 •

edited

Loading

Uh oh!

hao-aaron May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return server_url, await resp.json()
	return server_url, {"status": resp.status, "body": await resp.json()}

Conversation

hao-aaron commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 5, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hao-aaron May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hao-aaron commented May 4, 2026 •

edited

Loading

SumanthRH May 5, 2026 •

edited

Loading