Description
Hi,
In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.
While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).
The following stupid script mimics the problem I face:
p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }
p.open('w') { |f|
f.puts "<foos>"
n.times{ f.puts " <foo>Hello World</foo>" }
f.puts "</foos>"
}
ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
ping[c] if c % 1_000_000 == 0
c += 1
end
ping['after']
The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".
However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.
What we need in a document that looks like
<foos>
<foo>...</foo>
<foo>...</foo>
<foo>...</foo>
...
<foo>...</foo>
<foos>
is to iterate just on the entries. What is the recommendation in such a case?
Thanks a lot for Nokogiri