Explore why the builtin XPath function for CSS class selector is so slow on JRuby
#2138 opened on Dec 18, 2020
Description
Please describe the bug
For background, see #2135 and #2137. The native Java implementation of nokogiri-builtin:css-class(@class,'foo') is slower than the corresponding XPath expression contains(concat(' ',normalize-space(@class),' '),' foo ') and I don't know why. (The identical C implementation is ~2x faster than libxml2's XPath evaluator.)
It's not even the implementation in NokogiriXpathFunction.java:builtinCssClass() because I can short-circuit that method to return false and it's still considerably slower than the XPath expression.
I guess it's possible that Xerces has super-duper optimized these functions, or that some kind of really amazing caching is happening under the hood, but that wouldn't explain why simply calling through the function resolver to a native Java function would be so much slower.
I'm not experienced enough with Java to do the profiling necessary to understand what's happening here. I'd love someone's help.
Help us reproduce what you're seeing
Here's a benchmark script that attempts to bust any caches:
#! /usr/bin/env ruby
require "nokogiri"
require "benchmark/ips"
require "securerandom"
root = File.expand_path(File.join(File.dirname(__FILE__), ".."))
puts RUBY_DESCRIPTION
Benchmark.ips do |x|
x.time = 10
doc = Nokogiri::HTML::Document.parse(File.read(File.join(root, "test/files/tlm.html")))
[
[:xpath, "//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"],
[:xpath, "//*[nokogiri-builtin:css-class(@class, 'xxxx')]"],
].each do |method, query|
x.report("#{method}(\"#{query}\")") do
cache_buster = query.gsub("xxxx", "x" + SecureRandom.alphanumeric(4))
doc.public_send(method, cache_buster)
end
end
x.compare!
end
and here is the result:
jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 on 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 [linux-x86_64]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
74.000 i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
41.000 i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
814.536 (± 9.6%) i/s - 8.066k in 10.022432s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
443.781 (± 6.8%) i/s - 4.428k in 10.029857s
Comparison:
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"): 814.5 i/s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"): 443.8 i/s - 1.84x (± 0.00) slower
Expected behavior
I guess I expected this to be the same or faster than the XPath implementation.
Environment
In order to access the nokogiri-builtin xpath functions, you'll need to be on the branch from #2137 until it's merged onto master; and after that point you'll need to be on master.