Update January 2023: Since writing this post, Xsv has seen multiple releases with feature and
performance improvements, and has gathered an active
user base that amassed over 300.000 downloads of this gem. Thanks to pull requests and issue reports
from many contributors, Xsv is now not only the most performant but also the most compatible Ruby gem
to parse Excel files with.
Original post below
Today marks the release of Xsv 1.0.0, a high performance, pure-Ruby gem to parse .xlsx (Excel) files.
About a year ago I released the first version of Xsv, a ruby gem to parse .xlsx (Excel) files. From the README:
Xsv is a fast, lightweight, pure Ruby parser for Office Open XML spreadsheet files (commonly known as Excel or .xlsx files). It strives to be minimal in the sense that it provides nothing a CSV reader wouldn’t, meaning it only deals with minimal formatting and cannot create or modify documents.
Xsv is designed for worksheets with a single table of data, optionally with a header row. It only casts values to basic Ruby types (integer, float, date and time) and does not deal with most formatting or more advanced functionality. It strives for fast processing of large worksheets with minimal RAM and CPU consumption and has been in production use since the earliest versions.
Xsv stands for ‘Excel Separated Values’, because Excel just gets in the way.
XML parser in native Ruby
Over the past year various issues and pull requests were issued from the community, making Xsv a mature and stable product. For the 1.0.0 release I wanted to take things a bit further. Up until now, Xsv relied on an third-party gem to parse the XML inside the .xlsx files. This gem used a native extension written in C, which made it fast but also lead to some issues. The extension didn’t perform well on alternative Ruby implementations like TruffleRuby and JRuby, plus it’s currently incompatible with Ruby 3.0’s new parallelism paradigm, Ractor.
The parsing of Excel files only requires a very basic XML parser, so I decided to see if I could write that in pure Ruby without sacrificing too much performance. The result is a SAX-like XML parser in only 88 lines of Ruby code. There is a small performance hit compared to the old implementation, but when compared to other Excel parsing gems Xsv is still the fastest in most if not all benchmarks. What’s even better, on JRuby and TruffleRuby the native Ruby version actually outperforms the native extension by a big margin!
Optimized Ruby code
To avoid allocations, the parser extensively uses in-place modifications on the string buffer like
String#chop!. The stackprof gem was invaluable in finding calls that were doing unnecessary allocations or performing poorly. All in all it was a fun exercise to get native Ruby code performing well. Plus, as Ruby runtimes mature, the performance of this code should only improve.
As of now, Xsv does not yet run in Ractor due to a problem with the
rubyzip dependency. Besides, from my testing with a quickly patched
rubyzip, Ractor is far from stable. But once this situation improves, Xsv is ready to run in Ractor. Today you can already run it multi-threaded in any runtime that allows for parallel execution of Ruby code. I would say interesting times are ahead for Rubyists!
Xsv on Github
Xsv on Rubygems
Xsv on Ruby Toolbox