Efficient XML parsing in Ruby
By Martijn Storck
TL;DR I wrote a super fast Excel file reader called Xsv, check it out! It’s fast and low on memory because it uses a SAX-based XML parser instead of a DOM-based one. Here’s some benchmarks to prove it.
Simple API for XML, or SAX for short is an event-based API, of which development started in December 1997 1. It is an alternative to the well-known Document Object Model, which was developed around the same time 2.
While DOM parsers parse the document into a logical tree and allows querying that tree using convenient selectors such as CSS or XPath, SAX parsers process XML files as a stream, sending events to the application calling the parser for events like ‘Element start’ and ‘Element end’. For reference, Wikipedia has a nice SAX Example demonstrating how these parsers work.
DOM vs SAX
There is a clear tradeoff here. Take for example a query on an HTML document, where the application wants to collect all the links (a
elements). DOM parsers have a very friendly API to accomplish this task, but they cannot operate without parsing the entire document structure and possibly the data into a tree. This is memory intensive and potentially computationally intensive as well which leads to problems when parsing larger documents.
SAX parsers on the other hand require very little memory, but move all the logic to the calling application. In case of the example query, the parser would run through the entire document sending events to the calling application. That application can ignore all events until it gets a Element start
event for an a
element. This requires a little more work for the programmer but can lead to faster parsing with less memory usage.
Example for the A element in Ruby. Nokogiri provides both a DOM and SAX parser, both based on libxml2.
DOM
x = Nokogiri::HTML.parse(File.read('ruby.html'))
x.css("a").each { |el| puts el.attributes['href']&.value }
SAX
class Links < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
if name == "a"
puts attrs.to_h["href"]
end
end
end
s = Nokogiri::HTML::SAX::Parser.new(Links.new)
s.parse(File.open("ruby.html"))
The differences in performance and memory usage are negligible in this example, so it’s clear why most programmers will prefer -and default to- the DOM parser. Let’s look at a real-world example where a SAX parser provides a big advantage; dealing with large XML spreadsheet files.
Real world usage: Xsv, the fast Excel OOXML parsing rubygem
In a recent Ruby consultancy project I had to deal with parsing many large CSV and XLSX files to import data. The CSV files were fine with the built in Ruby CSV methods, but my application was struggling with even modest Excel files (modest meaning a few thousand rows of data). All of the existing Excel parsing ruby gems were either performing slow, consuming a lot of memory, not handling certain files well or a combination of the three.
After ten minutes of careful deliberation and quickly skimming over the Office Open XML file format specification I decided to write my own Excel file parser. It didn’t seem that hard, and it wasn’t. My gem would have one purpose: allow the user to import the data in an Excel worksheet without caring about formatting, formulas, or modification of files. This should allow for a lightweight gem.
My initial implementation was built around the Nokogiri DOM parser, which allowed me to easily set up a proof of concept that could compete with existing gems. However performance was lacking. Loading a 5MB Excel spreadsheet would easily allocate over 200MB of memory and loading a 50MB spreadsheet would bring my MacBook Pro to it’s knees.
The DOM parser was clearly the culprit, so at that point I decided to reimplement the parser using the SAX API. It’s a perfect fit since I’m doing nothing more than importing the sheet top to bottom.
Instead of Nokogiri, I used Ox, an alternative with a very fast SAX parser, not relying on extensive libraries. Ox is written and maintained by Peter Ohler who also wrote the popular oj
JSON Parser for Ruby.
The OOXML Spreadsheet file format
The basic XML structure of an Excel worksheet consists of rows (r
elements) of columns (c
elements) is as follows:
<sheetData>
<row>
<c><v>Value of Cell A1</v></c>
<c><v>Value of Cell B1</v></c>
</row>
<row>
<c><v>Value of Cell A2</v></c>
</row>
<sheetData>
The file format is very efficient so the reality is a lot more complicated than the above but you get the basic idea.
The old DOM parser would simply collect all the rows using css(“sheetData row”)
and iterate over the c
elements in those rows. You can find the implementation in older versions of my Xsv gem.
The new SAX parser basically does the same thing, but requires a bit more code and uses a state machine of sorts to keep track if what is happening in the document. The new parser comes in at under 130 lines so is still easy to grasp. For the sheet above the following steps would be executed to parse the first row:
- start_element(
r
) clears the@current_row
array/hash - start_element(
c
) clears the@current_value
- start_element(
v
) stets the@state
so thetext
handler knows it needs to store text that’s coming - text appends text to
@current_value
- end_element(
c
) parses the content in@current_value
and adds it to@current_row
- start_element(
c
) clears the@current_value
- start_element(
v
) stets the@state
so thetext
handler knows it needs to store text that’s coming - text appends text to
@current_value
- end_element(
c
) parses the content in@current_value
and adds it to@current_row
- end_element(
row
) yields@current_row
to the caller
Benchmarks & Conclusion
The SAX parser streams the XML input and rows are yielded as they occur, without parsing the XML file upfront. This makes it possible for Xsv to parse xlsx files with very little memory use and very high performance. My friend shkm did a shootout benchmarking various Excel parsing gems which I encourage you to check out: Faster Excel Parsing in Ruby (schembri.me).
Next time you are facing trouble handling big XML or HTML files, definitely give a streaming Sax parser a shot!
The Xsv gem has been in production use for a few months and can be found on github.com/martijn/xsv.