Skip to content

Fix/service availability race#3711

Open
AravindhDeivaG wants to merge 2 commits intomoveit:humblefrom
AravindhDeivaG:fix/service-availability-race
Open

Fix/service availability race#3711
AravindhDeivaG wants to merge 2 commits intomoveit:humblefrom
AravindhDeivaG:fix/service-availability-race

Conversation

@AravindhDeivaG
Copy link
Copy Markdown

Summary

Fix a race condition in MoveGroupInterfaceImpl initialization where service calls could be made before the corresponding service servers are available.

Problem

During initialization, MoveGroupInterfaceImpl queries services such as query_planner_interface. In distributed setups, service discovery can be delayed due to asynchronous DDS initialization.

This can lead to a race condition where the service client attempts to call the service before it is available, resulting in incomplete initialization (e.g. missing markers in RViz) or blocking behavior.

This issue is more likely to appear in multi-machine setups and when additional discovery latency is introduced (e.g. DDS security).

For full reproduction details, see the related issue.

Solution

Add a guard to ensure that required services are available before issuing service requests during initialization.

This is implemented by waiting for the service using wait_for_service() before making the call.

Testing

  • Tested in a multi-machine setup
  • Confirmed that initialization is now consistent and no longer exhibits intermittent failures

Related Issue

#3698

…rvice availability

Service calls could occur before the corresponding server was available, leading to intermittent failures in distributed setups. This change ensures the service is available before making the call.
@AravindhDeivaG AravindhDeivaG changed the base branch from main to humble March 21, 2026 20:57
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 21, 2026

Please target the main branch for development, we will backport the changes to humble for you if approved and if they don't break API.


// Function to check if the service server is available
auto wait_for_service = [this](auto& client, const std::string& service_name) {
if (!client->wait_for_service(std::chrono::seconds(10)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already seeing problems where MoveIt is not tolerant enough to startup delays on large systems. I think this will cause more startup crashes on our system.

If there is going to be a hard timeout, could it at least be configurable?

Copy link
Copy Markdown
Author

@AravindhDeivaG AravindhDeivaG Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there are two three possible fixes. Either we can make the block indefinite because there is no point in proceeding if rviz plugin can't find the list of planners available.

Secondly the action above the service has a standard 30 seconds delay which cannot be configured. We can add a similarly large delay.

Third we can declare a parameter which can be passed using yaml file as the plugin uses rviz node for all the communication and user should be able to configure it during launch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be for this to use a configurable timeout, like the wait_for_servers timeout used in the action client above.

I'm not sure what the MoveIt maintainers would prefer though.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for servers uses a constant 30 seconds delay. It is not configurable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought wait_for_servers was a constructor argument of MoveGroupInterface, and so user-configurable?

Maybe we're talking about different waits? Where is the 30 second wait you're talking about defined?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I understand now, thanks for the clarification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful for the plugin to have configurable waits, but I'm guessing that's out of scope for this PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'll add another argument to the constructor call for service wait time, with default value of 10 seconds. Is that okay?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds sensible to me

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the inputs.

…to include a rclcpp::Duration for wait_for_service delay time, which can be configured at the place of construction of MoveGroupInterfaceImpl object
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.22%. Comparing base (e7004f4) to head (9ae86c5).

Additional details and impacted files
@@            Coverage Diff             @@
##           humble    #3711      +/-   ##
==========================================
+ Coverage   51.21%   51.22%   +0.01%     
==========================================
  Files         407      407              
  Lines       30667    30667              
==========================================
+ Hits        15704    15705       +1     
+ Misses      14963    14962       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants