[bug] Lora bug fixes + port collision fix#1625
[bug] Lora bug fixes + port collision fix#1625hao-aaron wants to merge 2 commits intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a port striding mechanism to prevent port conflicts between inference servers and implements a custom vLLM endpoint for loading LoRA adapters to bypass specific upstream bugs. The review identifies a potential AttributeError when accessing the LoRA ID counter, suggests adding a boundary check for the maximum valid port number to avoid overflow errors, and recommends maintaining a consistent response format for control plane operations.
| lora_int_id = ( | ||
| models.lora_requests[lora_name].lora_int_id | ||
| if lora_name in models.lora_requests | ||
| else models.lora_id_counter.inc(1) | ||
| ) |
There was a problem hiding this comment.
The lora_id_counter in vLLM's OpenAIServingModels is typically an iterator (like itertools.count or a custom Counter class) and does not provide an inc() method. Attempting to call models.lora_id_counter.inc(1) will likely result in an AttributeError. In standard vLLM implementations, the next ID should be obtained using next(models.lora_id_counter) or models.lora_id_counter.generate() depending on the specific vLLM version being used.
async with models.lora_resolver_lock[lora_name]:
lora_int_id = (
models.lora_requests[lora_name].lora_int_id
if lora_name in models.lora_requests
else next(models.lora_id_counter)
)| end_port = start_port + SERVER_PORT_STRIDE | ||
| sock: socket.socket | None = None | ||
| while True: | ||
| while port < end_port: |
There was a problem hiding this comment.
The loop condition port < end_port does not account for the maximum valid port number (65535). If start_port + SERVER_PORT_STRIDE exceeds 65535, socket.bind will eventually raise an OverflowError when port reaches 65536. Since the except OSError block does not catch OverflowError, the function will crash with a traceback instead of raising the descriptive RuntimeError defined at the end of the function. It is safer to include a check for the valid port range in the loop condition.
| while port < end_port: | |
| while port < end_port and port <= 65535: |
| body = await resp.json() | ||
| raise_for_status(resp, body) | ||
| return server_url, {"status": resp.status, "body": await resp.text()} | ||
| return server_url, await resp.json() |
There was a problem hiding this comment.
The return value of _load_on_server has been changed to return the JSON body directly, whereas other control plane methods in this class (e.g., pause, resume, sleep, reset_prefix_cache) return a standardized dictionary containing both the HTTP status and the body. This inconsistency in the response format across the RemoteInferenceClient API could lead to issues for callers or monitoring tools that expect a uniform structure for all control plane operations.
| return server_url, await resp.json() | |
| return server_url, {"status": resp.status, "body": await resp.json()} |
Bug fixes found in #1616
While testing out the tinker CI, found a port collision bug.
async def _run_server(self) -> None:
"""Internal method to run the HTTP server."""
# Release the port reservation right before vLLM rebinds.
if self._port_reservation is not None:
self._port_reservation.close() # ← step A
self._port_reservation = None
...
sock_addr = (self._cli_args.host, self._cli_args.port)
sock = create_server_socket(sock_addr)
There exists a window of time where we free port so vllm can take it, but other actors can also. Fixed this by allocating a specific, non-overlapping port range for each actor.
Additionally, changed the disk reloading workaround #1606 created here to more closely mirror the upstream change vllm-project/vllm#41482, adding a new custom endpoint with the load_inplace override.