FastCSV
A fast Ragel-based CSV parser, compatible with Ruby's CSV.
Usage
FastCSV.raw_parse
is implemented in C and is the fastest way to read CSVs with FastCSV.
require 'fastcsv'
# Read from file.
File.open(filename) do |f|
FastCSV.raw_parse(f) do |row|
# do stuff
end
end
# Read from an IO object.
FastCSV.raw_parse(StringIO.new("foo,bar\n")) do |row|
# do stuff
end
# Read from a string.
FastCSV.raw_parse("foo,bar\n") do |row|
# do stuff
end
# Transcode like with the CSV module.
FastCSV.raw_parse("\xF1\n", encoding: 'iso-8859-1:utf-8') do |row|
# ["ñ"]
end
FastCSV can be used as a drop-in replacement for CSV (replace CSV
with FastCSV
) except:
- The
:row_sep
option is ignored. The default:auto
is implemented #9. - The
:col_sep
option must be a single-byte string, like the default,
#8. Python and PHP support single-byte delimiters only, as do the major libraries in JavaScript, Java, C, Objective-C and Perl. A major Node library supports multi-byte delimiters. The CSV Dialect Description Format allows only single-byte delimiters. - If FastCSV raises an error, you can't continue reading #3. Its error messages don't perfectly match those of CSV.
A few minor caveats:
- Use
FastCSV.parse_line(string, options)
instead ofstring.parse_csv(options)
. - If you were passing CSV an IO object on which you had wrapped
#gets
(for example, as described in this article),#gets
will not be called. - The
:field_size_limit
option is ignored. If you need to prevent DoS attacks – the ostensible reason for this option – limit the size of the input, not the size of quoted fields. - FastCSV doesn't support UTF-16 or UTF-32. See UTF-8 Everywhere.
Development
ragel -G2 ext/fastcsv/fastcsv.rl
ragel -Vp ext/fastcsv/fastcsv.rl | dot -Tpng -o machine.png
rake compile
gem uninstall fastcsv
rake install
rake
rspec test/runner.rb test/csv
Implementation
FastCSV implements its Ragel-based CSV parser in C at FastCSV::Parser
.
FastCSV is a subclass of CSV. It overrides #shift
, replacing the parsing code, in order to act as a drop-in replacement.
FastCSV's raw_parse
requires a block to which it yields one row at a time. FastCSV uses Fibers to pass control back to #shift
while parsing.
CSV delegates IO methods to the IO object it's reading. IO methods that move the pointer within the file like rewind
changes the behavior of CSV's #shift
. However, FastCSV's C code won't take notice. We therefore null the Fiber whenever the pointer is moved, so that #shift
uses a new Fiber.
CSV's #shift
runs the regular expression in the :skip_lines
option against a row's raw text. FastCSV::Parser
implements a row
method, which returns the most recently parsed row's raw text.
FastCSV is tested against the same tests as CSV. See TESTS.md for details.
Why?
I evaluated many CSV Ruby gems, and they were either too slow or had implementation errors. rcsv is fast and libcsv-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote. bamfcsv is well implemented, but it's considerably slower on large files. I looked for Ragel-based CSV parsers to copy, but they either had implementation errors or could not handle large files. commas looks good, but it performs a memory check on each character, which is overkill.
Acknowledgements
Started as a Ruby 2.1 fork of MoonWolf [email protected]'s CSVScan, found in this commit. CSVScan uses Ragel code from HPricot from this commit. Most of the Ruby (i.e. non-C, non-Ragel) methods are copied from CSV.
Copyright (c) 2014 James McKinney, released under the MIT license