FastCSV

A fast Ragel-based CSV parser, compatible with Ruby's CSV.

Usage

FastCSV.raw_parse is implemented in C and is the fastest way to read CSVs with FastCSV.

require 'fastcsv'

# Read from file.
File.open(filename) do |f|
  FastCSV.raw_parse(f) do |row|
    # do stuff
  end
end

# Read from an IO object.
FastCSV.raw_parse(StringIO.new("foo,bar\n")) do |row|
  # do stuff
end

# Read from a string.
FastCSV.raw_parse("foo,bar\n") do |row|
  # do stuff
end

# Transcode like with the CSV module.
FastCSV.raw_parse("\xF1\n", encoding: 'iso-8859-1:utf-8') do |row|
  # ["ñ"]
end

FastCSV can be used as a drop-in replacement for CSV (replace CSV with FastCSV) except:

The :row_sep option is ignored. The default :auto is implemented #9.
The :col_sep option must be a single-byte string, like the default , #8. Python and PHP support single-byte delimiters only, as do the major libraries in JavaScript, Java, C, Objective-C and Perl. A major Node library supports multi-byte delimiters. The CSV Dialect Description Format allows only single-byte delimiters.
If FastCSV raises an error, you can't continue reading #3. Its error messages don't perfectly match those of CSV.

A few minor caveats:

Use FastCSV.parse_line(string, options) instead of string.parse_csv(options).
If you were passing CSV an IO object on which you had wrapped #gets (for example, as described in this article), #gets will not be called.
The :field_size_limit option is ignored. If you need to prevent DoS attacks – the ostensible reason for this option – limit the size of the input, not the size of quoted fields.
FastCSV doesn't support UTF-16 or UTF-32. See UTF-8 Everywhere.

Development

ragel -G2 ext/fastcsv/fastcsv.rl
ragel -Vp ext/fastcsv/fastcsv.rl | dot -Tpng -o machine.png
rake compile
gem uninstall fastcsv
rake install
rake
rspec test/runner.rb test/csv

Implementation

FastCSV implements its Ragel-based CSV parser in C at FastCSV::Parser.

FastCSV is a subclass of CSV. It overrides #shift, replacing the parsing code, in order to act as a drop-in replacement.

FastCSV's raw_parse requires a block to which it yields one row at a time. FastCSV uses Fibers to pass control back to #shift while parsing.

CSV delegates IO methods to the IO object it's reading. IO methods that move the pointer within the file like rewind changes the behavior of CSV's #shift. However, FastCSV's C code won't take notice. We therefore null the Fiber whenever the pointer is moved, so that #shift uses a new Fiber.

CSV's #shift runs the regular expression in the :skip_lines option against a row's raw text. FastCSV::Parser implements a row method, which returns the most recently parsed row's raw text.

FastCSV is tested against the same tests as CSV. See TESTS.md for details.

Why?

I evaluated many CSV Ruby gems, and they were either too slow or had implementation errors. rcsv is fast and libcsv-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote. bamfcsv is well implemented, but it's considerably slower on large files. I looked for Ragel-based CSV parsers to copy, but they either had implementation errors or could not handle large files. commas looks good, but it performs a memory check on each character, which is overkill.

Acknowledgements

Started as a Ruby 2.1 fork of MoonWolf [email protected]'s CSVScan, found in this commit. CSVScan uses Ragel code from HPricot from this commit. Most of the Ruby (i.e. non-C, non-Ragel) methods are copied from CSV.