sparklemotion/nokogiri

JRuby XML::Reader memory performance is poor

Open

#2224 aperta il 23 apr 2021

Vedi su GitHub
 (7 commenti) (0 reazioni) (0 assegnatari)Ruby (806 fork)batch import
help wantedplatform/jruby

Metriche repository

Star
 (5615 star)
Metriche merge PR
 (Nessuna PR mergiata in 30 g)

Descrizione

Hi,

In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.

While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).

The following stupid script mimics the problem I face:

p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }

p.open('w') { |f|
    f.puts "<foos>"
    n.times{ f.puts "  <foo>Hello World</foo>" }
    f.puts "</foos>"
}

ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
    ping[c] if c % 1_000_000 == 0
    c += 1
end
ping['after']

The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".

However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.

What we need in a document that looks like

<foos>
  <foo>...</foo>
  <foo>...</foo>
  <foo>...</foo>
  ...
  <foo>...</foo>
<foos>

is to iterate just on the entries. What is the recommendation in such a case?

Thanks a lot for Nokogiri

Guida contributor