JRuby XML::Reader memory performance is poor
#2.224 aberto em 23 de abr. de 2021
Métricas do repositório
- Stars
- (5.615 stars)
- Métricas de merge de PR
- (Nenhuma PRs mesclada em 30d)
Description
Hi,
In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.
While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).
The following stupid script mimics the problem I face:
p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }
p.open('w') { |f|
f.puts "<foos>"
n.times{ f.puts " <foo>Hello World</foo>" }
f.puts "</foos>"
}
ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
ping[c] if c % 1_000_000 == 0
c += 1
end
ping['after']
The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".
However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.
What we need in a document that looks like
<foos>
<foo>...</foo>
<foo>...</foo>
<foo>...</foo>
...
<foo>...</foo>
<foos>
is to iterate just on the entries. What is the recommendation in such a case?
Thanks a lot for Nokogiri