Skip to content

XPath nodeset always in document-order#320

Open
tompng wants to merge 1 commit into
ruby:masterfrom
tompng:no_reverse_order
Open

XPath nodeset always in document-order#320
tompng wants to merge 1 commit into
ruby:masterfrom
tompng:no_reverse_order

Conversation

@tompng
Copy link
Copy Markdown
Member

@tompng tompng commented May 26, 2026

Fixes problems described in #319
Change nodeset ordering to match XPath 1.0 spec
Changes XPath#first XPath#match ordering

In XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered.
Functions such as local-name should use first node in document order.
Filter expressions such as (preceding::foo)[1] need to use document-order.
JavaScript's XPathResult.ORDERED_NODE_ITERATOR_TYPE XPathResult.FIRST_ORDERED_NODE_TYPE always orders in document order.
This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on nodesets.size.

Copilot AI review requested due to automatic review settings May 26, 2026 12:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Aligns REXML XPath node-set results with XPath 1.0 “document order” requirements and fixes inconsistent ordering behavior across queries (notably for first/match and reverse axes like preceding*).

Changes:

  • Update XPath parser step processing to always return document-ordered node-sets (remove reverse ordering and unify ordering logic).
  • Adjust existing preceding / preceding-sibling expectations in tests to match document order semantics.
  • Add a new regression test to ensure ordering is consistent regardless of matched nodeset size.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/xpath/test_base.rb Updates ordering-related assertions and adds a regression test for consistent document ordering.
test/xpath/test_axis_preceding_sibling.rb Updates expected preceding-sibling ordering to document order.
lib/rexml/xpath_parser.rb Removes reverse-order option and enforces document-order sorting for step results.
Comments suppressed due to low confidence (1)

lib/rexml/xpath_parser.rb:1

  • This change removes the previous nodesets.size == 1 fast-path, so a single large nodeset now always pays the de-dupe + index-sort cost. If this code is performance-sensitive, consider reintroducing a safe optimization: when nodesets.size == 1 and the yielded nodeset is already known to be in document order (e.g., from a forward axis), skip the expensive index sort (while still sorting whenever nodesets.size > 1 or the axis is reverse/known non-document-ordered).
# frozen_string_literal: false

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/xpath/test_base.rb Outdated
Comment thread lib/rexml/xpath_parser.rb
Comment on lines 664 to 666
new_arry = []
array_of_nodes.each { |node|
node_idx = []
@tompng tompng mentioned this pull request May 26, 2026
Copilot AI review requested due to automatic review settings May 28, 2026 03:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

lib/rexml/xpath_parser.rb:1

  • With step now always funneling through sort(raw_nodes), sort will be exercised more broadly. The ordering logic assumes node.parent.index(node) is always a valid numeric index; that’s not true for node types that are not children in their parent’s children list (notably attributes, and potentially namespace nodes, depending on REXML’s node model). In those cases index(node) can be nil, which can make the sort key contain nil and raise during comparisons. Consider explicitly handling non-child node types when computing the sort key (e.g., derive an order relative to the owning element, or fall back to a stable tie-breaker) so sort is safe for all node kinds that XPath can return.
# frozen_string_literal: false

Comment thread lib/rexml/xpath_parser.rb Outdated
Comment on lines +462 to +474

seen = {}.compare_by_identity
raw_nodes = []
nodesets.each do |nodeset|
nodeset.each do |node|
raw_node = node.respond_to?(:raw_node) ? node.raw_node : node
next if seen.key?(raw_node)
seen[raw_node] = true
raw_nodes << raw_node
end
ordered_nodeset = sort(raw_nodes, order)
end
ordered_nodeset = sort(raw_nodes)

Comment thread lib/rexml/xpath_parser.rb
end
if nodesets.size == 1
ordered_nodeset = nodesets[0]
else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tompng
Thank you for this PR.

But, performance has decreased because the optimization for nodesets.size == 1 has been removed.
Is there any way to improve this?

$ git checkout master
$ rake build
rexml 3.4.5 built to pkg/rexml-3.4.5.gem.
$ gem install pkg/rexml-3.4.5.gem
$ git checkout no_reverse_order
$ benchmark-driver benchmark/xpath_no_reverse_order.yaml
Calculating -------------------------------------
                     master(a98066c)       after  master(a98066c,YJIT)  after(YJIT) 
               child         280.227      19.016               544.496       31.084 i/s -     100.000 times in 0.356853s 5.258603s 0.183656s 3.217105s
          descendant          18.445       9.682                28.523       15.258 i/s -     100.000 times in 5.421532s 10.327928s 3.505883s 6.554030s

Comparison:
                            child
master(a98066c,YJIT):       544.5 i/s 
     master(a98066c):       280.2 i/s - 1.94x  slower
         after(YJIT):        31.1 i/s - 17.52x  slower
               after:        19.0 i/s - 28.63x  slower

                       descendant
master(a98066c,YJIT):        28.5 i/s 
     master(a98066c):        18.4 i/s - 1.55x  slower
         after(YJIT):        15.3 i/s - 1.87x  slower
               after:         9.7 i/s - 2.95x  slower
  • benchmark/xpath_no_reverse_order.yaml
loop_count: 100
contexts:
  - name: master(a98066c)
    gems:
      rexml: 3.4.5
    require: false
    prelude: require 'rexml'
  - name: after
    prelude: |
      $LOAD_PATH.unshift(File.expand_path("lib"))
      require 'rexml'
  - name: master(a98066c,YJIT)
    gems:
      rexml: 3.4.5
    require: false
    prelude: |
      require 'rexml'
      RubyVM::YJIT.enable
  - name: after(YJIT)
    prelude: |
      $LOAD_PATH.unshift(File.expand_path("lib"))
      require 'rexml'
      RubyVM::YJIT.enable

prelude: |
  require "rexml"
  xml = "<root>" + (1..2000).map { |i| "<item id='#{i}'/>" }.join + "</root>"
  doc = REXML::Document.new(xml)

benchmark:
  child: REXML::XPath.match(doc, "/root/item")
  descendant: REXML::XPath.match(doc, "//item")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to improve this?

Yes, there are several ways.

  • A. Improve performance of sort

    • Should be done in a separate PR
  • B. Mark nodesets ordering

    • ordered_nodeset = marked_as_ordered? ? nodesets.first : nodesets.first.reverse
    • Ordering is currently :ordered | :reversed but in Optimize XPath step #315 we need to add :unordered
    • Conflicts with C
  • C. Delay nodeset ordering on demand

    • Sort the final result and in some functions such number(xpath)
    • Requires predicate position-independency analytics, requires sorted nodeset or not
    • After doing this, optimization of B needs to be extended. We need to re-implement B in a different way again.

I implemented B with changing the meaning of order: keyword parameter

  • Before: key to specify the output order: :forward | :reverse
  • After: key to specify the state of nodesets ordering: :ordered | :reversed | :unordered(future)

Perhaps this approach will be a blocker implementing C(delayed nodeset ordering), and optimizing XPathParser#step (#315) but we can consider it later.

In XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered.
Functions such as local-name should use first node in document order.
Filter expressions such as `(preceding::foo)[1]` need to use document-order.
JavaScript's `XPathResult.ORDERED_NODE_ITERATOR_TYPE` `XPathResult.FIRST_ORDERED_NODE_TYPE` always orders in document order.
This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on `nodesets.size`.
@tompng tompng force-pushed the no_reverse_order branch from bdcf501 to 1f46aef Compare June 5, 2026 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants