Skip to content

Add support for multi-threaded xpath evaluation#191

Open
maartendeprez wants to merge 2 commits intoKWARC:masterfrom
maartendeprez:add-readonly-xpath-support
Open

Add support for multi-threaded xpath evaluation#191
maartendeprez wants to merge 2 commits intoKWARC:masterfrom
maartendeprez:add-readonly-xpath-support

Conversation

@maartendeprez
Copy link
Copy Markdown

The readonly module provides support for multi-threaded read-only operations over a parsed document. However, the findnodes method takes a shared reference to a Document, which is not readonly and therefore not Send + Sync. This makes it impossible, currently, to evaluate xpaths over the document from multiple threads.

This PR changes that by introducing a read-only version of Document, Context and Object, as well as a findnodes_readonly method on RoNode.

@dginev
Copy link
Copy Markdown
Member

dginev commented Mar 20, 2026

Thank you, this appears to be a very useful extension.

Could I also request that you mirror our one findnodes test here:

#[test]
/// Test that the dual findnodes interfaces are operational
fn findnodes_interfaces() {
let parser = Parser::default_html();
let doc_result = parser.parse_file("tests/resources/file02.xml");
assert!(doc_result.is_ok());
let doc = doc_result.unwrap();
// Xpath interface
let mut context = Context::new(&doc).unwrap();
let body = context.evaluate("/html/body").unwrap().get_nodes_as_vec();
let p_result = context.findnodes("p", body.first());
assert!(p_result.is_ok());
let p = p_result.unwrap();
assert_eq!(p.len(), 1);
// Node interface
let body_node = body.first().unwrap();
let p2_result = body_node.findnodes("p");
assert!(p2_result.is_ok());
let p2 = p2_result.unwrap();
assert_eq!(p2.len(), 1);
}

With a similar entry inside readonly_tests.rs ?

@maartendeprez
Copy link
Copy Markdown
Author

maartendeprez commented Mar 22, 2026

Copied and adapted the tests and added the required read-only methods to implement this. I left two tests commented out in the read-only version:

  • cleanup_safely_unlinked_xpath_nodes
  • xpath_with_namespaces

Both require the register_namespace method, which seems to require exclusive access to the RoContext object. This could be ensured by taking &mut self, were it not that the context is cloneable without actually duplicating the underlying libxml object. The first one would also require RoDocument::as_node, which requires exclusive access to the document.

@maartendeprez
Copy link
Copy Markdown
Author

Hm, upon further consideration, my design is not correct at all, since the Ro* types wrap the read-write variants containing Rc to the libxml pointer, which lacks the necessary atomicity in the reference counter to share this between threads. I'll need to rethink this.

@dginev
Copy link
Copy Markdown
Member

dginev commented Mar 26, 2026

I'll need to rethink this.

Is there anything I can help with? The main background I can offer is that multi-threaded workflows on the Ro* nodes are completely OK as long as they only ever need to do read-only operations.

Mixing the read-only API of the create with the writable API is likely something that needs some significant redesign work, as it wasn't an original desire. My thinking at the time was, if you need even one writable operation, do not use the Ro* types.

@maartendeprez
Copy link
Copy Markdown
Author

My thinking at the moment is that for the read-only API, I'll need to do away with all the reference counting stuff (i.e. the Rcs) in the Ro* types and use the auto-generated pointer types directly, as you did for RoNode. Unfortunately that causes quite a bit of code duplication.

Excluding mixed use of the writable and read-only APIs at the type level, is what i was alluding to in issue 192. That would need to be done in a separate PR. In rust when we're "not supposed to" do something that would cause undefined behavior, the API is usually designed in such a way that the compiler will not allow us to do so without writing "unsafe" code, ensuring that "Safe Rust can’t cause Undefined Behavior".

On the other hand, if we want to allow mixing read-only and read-write APIs, it seems possible to leverage the rust type-system to enforce the contract on top of the Ro*-style API. Methods that require exclusive access would take &mut self, forcing the user to take care of exclusion rules, e.g. by wrapping a shared RoDocument in a Mutex or RwLock if it needs to be accessed from multiple threads. However I am not intimately familiar with libxml2; you might immediately see issues I am not aware of.

If that kind of API is feasible, it would provide a completely different (arguably lower-level) API. If we go through with this, I think we should separate the bindings in a libxml2-sys crate and provide two separate crates for the two styles of "safe" API wrappers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants