XPath nodeset always in document-order#320
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Aligns REXML XPath node-set results with XPath 1.0 “document order” requirements and fixes inconsistent ordering behavior across queries (notably for first/match and reverse axes like preceding*).
Changes:
- Update XPath parser step processing to always return document-ordered node-sets (remove reverse ordering and unify ordering logic).
- Adjust existing
preceding/preceding-siblingexpectations in tests to match document order semantics. - Add a new regression test to ensure ordering is consistent regardless of matched nodeset size.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/xpath/test_base.rb | Updates ordering-related assertions and adds a regression test for consistent document ordering. |
| test/xpath/test_axis_preceding_sibling.rb | Updates expected preceding-sibling ordering to document order. |
| lib/rexml/xpath_parser.rb | Removes reverse-order option and enforces document-order sorting for step results. |
Comments suppressed due to low confidence (1)
lib/rexml/xpath_parser.rb:1
- This change removes the previous
nodesets.size == 1fast-path, so a single large nodeset now always pays the de-dupe + index-sort cost. If this code is performance-sensitive, consider reintroducing a safe optimization: whennodesets.size == 1and the yielded nodeset is already known to be in document order (e.g., from a forward axis), skip the expensive index sort (while still sorting whenevernodesets.size > 1or the axis is reverse/known non-document-ordered).
# frozen_string_literal: false
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| new_arry = [] | ||
| array_of_nodes.each { |node| | ||
| node_idx = [] |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
lib/rexml/xpath_parser.rb:1
- With
stepnow always funneling throughsort(raw_nodes),sortwill be exercised more broadly. The ordering logic assumesnode.parent.index(node)is always a valid numeric index; that’s not true for node types that are not children in their parent’schildrenlist (notably attributes, and potentially namespace nodes, depending on REXML’s node model). In those casesindex(node)can benil, which can make the sort key containniland raise during comparisons. Consider explicitly handling non-child node types when computing the sort key (e.g., derive an order relative to the owning element, or fall back to a stable tie-breaker) sosortis safe for all node kinds that XPath can return.
# frozen_string_literal: false
|
|
||
| seen = {}.compare_by_identity | ||
| raw_nodes = [] | ||
| nodesets.each do |nodeset| | ||
| nodeset.each do |node| | ||
| raw_node = node.respond_to?(:raw_node) ? node.raw_node : node | ||
| next if seen.key?(raw_node) | ||
| seen[raw_node] = true | ||
| raw_nodes << raw_node | ||
| end | ||
| ordered_nodeset = sort(raw_nodes, order) | ||
| end | ||
| ordered_nodeset = sort(raw_nodes) | ||
|
|
| end | ||
| if nodesets.size == 1 | ||
| ordered_nodeset = nodesets[0] | ||
| else |
There was a problem hiding this comment.
@tompng
Thank you for this PR.
But, performance has decreased because the optimization for nodesets.size == 1 has been removed.
Is there any way to improve this?
$ git checkout master
$ rake build
rexml 3.4.5 built to pkg/rexml-3.4.5.gem.
$ gem install pkg/rexml-3.4.5.gem
$ git checkout no_reverse_order
$ benchmark-driver benchmark/xpath_no_reverse_order.yaml
Calculating -------------------------------------
master(a98066c) after master(a98066c,YJIT) after(YJIT)
child 280.227 19.016 544.496 31.084 i/s - 100.000 times in 0.356853s 5.258603s 0.183656s 3.217105s
descendant 18.445 9.682 28.523 15.258 i/s - 100.000 times in 5.421532s 10.327928s 3.505883s 6.554030s
Comparison:
child
master(a98066c,YJIT): 544.5 i/s
master(a98066c): 280.2 i/s - 1.94x slower
after(YJIT): 31.1 i/s - 17.52x slower
after: 19.0 i/s - 28.63x slower
descendant
master(a98066c,YJIT): 28.5 i/s
master(a98066c): 18.4 i/s - 1.55x slower
after(YJIT): 15.3 i/s - 1.87x slower
after: 9.7 i/s - 2.95x slower- benchmark/xpath_no_reverse_order.yaml
loop_count: 100
contexts:
- name: master(a98066c)
gems:
rexml: 3.4.5
require: false
prelude: require 'rexml'
- name: after
prelude: |
$LOAD_PATH.unshift(File.expand_path("lib"))
require 'rexml'
- name: master(a98066c,YJIT)
gems:
rexml: 3.4.5
require: false
prelude: |
require 'rexml'
RubyVM::YJIT.enable
- name: after(YJIT)
prelude: |
$LOAD_PATH.unshift(File.expand_path("lib"))
require 'rexml'
RubyVM::YJIT.enable
prelude: |
require "rexml"
xml = "<root>" + (1..2000).map { |i| "<item id='#{i}'/>" }.join + "</root>"
doc = REXML::Document.new(xml)
benchmark:
child: REXML::XPath.match(doc, "/root/item")
descendant: REXML::XPath.match(doc, "//item")There was a problem hiding this comment.
Is there any way to improve this?
Yes, there are several ways.
-
A. Improve performance of sort
- Should be done in a separate PR
-
B. Mark nodesets ordering
ordered_nodeset = marked_as_ordered? ? nodesets.first : nodesets.first.reverse- Ordering is currently
:ordered | :reversedbut in Optimize XPath step #315 we need to add:unordered - Conflicts with C
-
C. Delay nodeset ordering on demand
- Sort the final result and in some functions such
number(xpath) - Requires predicate position-independency analytics, requires sorted nodeset or not
- After doing this, optimization of B needs to be extended. We need to re-implement B in a different way again.
- Sort the final result and in some functions such
I implemented B with changing the meaning of order: keyword parameter
- Before: key to specify the output order:
:forward | :reverse - After: key to specify the state of nodesets ordering:
:ordered | :reversed | :unordered(future)
Perhaps this approach will be a blocker implementing C(delayed nodeset ordering), and optimizing XPathParser#step (#315) but we can consider it later.
In XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered. Functions such as local-name should use first node in document order. Filter expressions such as `(preceding::foo)[1]` need to use document-order. JavaScript's `XPathResult.ORDERED_NODE_ITERATOR_TYPE` `XPathResult.FIRST_ORDERED_NODE_TYPE` always orders in document order. This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on `nodesets.size`.
Fixes problems described in #319
Change nodeset ordering to match XPath 1.0 spec
Changes
XPath#firstXPath#matchorderingIn XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered.
Functions such as
local-nameshould use first node in document order.Filter expressions such as
(preceding::foo)[1]need to use document-order.JavaScript's
XPathResult.ORDERED_NODE_ITERATOR_TYPEXPathResult.FIRST_ORDERED_NODE_TYPEalways orders in document order.This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on
nodesets.size.