[bug] Lora bug fixes + port collision fix by hao-aaron · Pull Request #1625 · NovaSky-AI/SkyRL

hao-aaron · 2026-05-06T00:32:32Z

Bug fixes found in #1616

While testing out the tinker CI, found a port collision bug.

async def _run_server(self) -> None:
"""Internal method to run the HTTP server."""
# Release the port reservation right before vLLM rebinds.
if self._port_reservation is not None:
self._port_reservation.close() # ← step A
self._port_reservation = None
...
sock_addr = (self._cli_args.host, self._cli_args.port)
sock = create_server_socket(sock_addr)
There exists a window of time where we free port so vllm can take it, but other actors can also. Fixed this by allocating a specific, non-overlapping port range for each actor.

Additionally, changed the disk reloading workaround #1606 created here to more closely mirror the upstream change vllm-project/vllm#41482, adding a new custom endpoint with the load_inplace override.

Signed-off-by: hao-aaron <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a port striding mechanism to prevent port conflicts between inference servers and implements a custom vLLM endpoint for loading LoRA adapters to bypass specific upstream bugs. The review identifies a potential AttributeError when accessing the LoRA ID counter, suggests adding a boundary check for the maximum valid port number to avoid overflow errors, and recommends maintaining a consistent response format for control plane operations.

gemini-code-assist · 2026-05-06T00:34:32Z

+                lora_int_id = (
+                    models.lora_requests[lora_name].lora_int_id
+                    if lora_name in models.lora_requests
+                    else models.lora_id_counter.inc(1)
+                )


The lora_id_counter in vLLM's OpenAIServingModels is typically an iterator (like itertools.count or a custom Counter class) and does not provide an inc() method. Attempting to call models.lora_id_counter.inc(1) will likely result in an AttributeError. In standard vLLM implementations, the next ID should be obtained using next(models.lora_id_counter) or models.lora_id_counter.generate() depending on the specific vLLM version being used.

async with models.lora_resolver_lock[lora_name]: lora_int_id = ( models.lora_requests[lora_name].lora_int_id if lora_name in models.lora_requests else next(models.lora_id_counter) )

gemini-code-assist · 2026-05-06T00:34:32Z

+    end_port = start_port + SERVER_PORT_STRIDE
    sock: socket.socket | None = None
-    while True:
+    while port < end_port:


The loop condition port < end_port does not account for the maximum valid port number (65535). If start_port + SERVER_PORT_STRIDE exceeds 65535, socket.bind will eventually raise an OverflowError when port reaches 65536. Since the except OSError block does not catch OverflowError, the function will crash with a traceback instead of raising the descriptive RuntimeError defined at the end of the function. It is safer to include a check for the valid port range in the loop condition.

Suggested change

while port < end_port:

while port < end_port and port <= 65535:

gemini-code-assist · 2026-05-06T00:34:32Z

                    body = await resp.json()
                    raise_for_status(resp, body)
-                return server_url, {"status": resp.status, "body": await resp.text()}
+                return server_url, await resp.json()


The return value of _load_on_server has been changed to return the JSON body directly, whereas other control plane methods in this class (e.g., pause, resume, sleep, reset_prefix_cache) return a standardized dictionary containing both the HTTP status and the body. This inconsistency in the response format across the RemoteInferenceClient API could lead to issues for callers or monitoring tools that expect a uniform structure for all control plane operations.

Suggested change

return server_url, await resp.json()

return server_url, {"status": resp.status, "body": await resp.json()}

hao-aaron added 2 commits May 6, 2026 00:29

x

f466c32

Signed-off-by: hao-aaron <ahao@anyscale.com>

x

418131a

Signed-off-by: hao-aaron <ahao@anyscale.com>

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Lora bug fixes + port collision fix#1625

[bug] Lora bug fixes + port collision fix#1625
hao-aaron wants to merge 2 commits intoNovaSky-AI:mainfrom
hao-aaron:lora-bug-fixes

hao-aaron commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	while port < end_port:
	while port < end_port and port <= 65535:

	return server_url, await resp.json()
	return server_url, {"status": resp.status, "body": await resp.json()}

Conversation

hao-aaron commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant