parse_fasta

So you want to parse a fasta file...

Installation

Add this line to your application's Gemfile:

gem 'parse_fasta'

And then execute:

$ bundle

Or install it yourself as:

$ gem install parse_fasta

Overview

Provides nice, programmatic access to fasta and fastq files, as well as providing Sequence and Quality helper classes. It's more lightweight than BioRuby. And more fun! ;)

Documentation

Checkout parse_fasta docs for the full api documentation.

Usage

Some examples...

A little script to print header and length of each record.

require 'parse_fasta'

FastaFile.open(ARGV[0]).each_record do |header, sequence|
  puts [header, sequence.length].join("\t")
end

And here, a script to calculate GC content:

FastaFile.open(ARGV[0]).each_record do |header, sequence|
  puts [header, sequence.gc].join("\t")
end

Now we can parse fastq files as well!

FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual|
  puts [header, qual.qual_scores.join(',')].join("\t")
end

What if you don't care if the input is a fastA or a fastQ? No problem!

SeqFile.open(ARGV[0]).each_record do |head, seq|
  puts [header, seq].join "\t"
end

Read fasta file into a hash.

seqs = FastaFile.open(ARGV[0]).to_hash

Versions

1.7

Add SeqFile#to_hash, FastaFile#to_hash and FastqFile#to_hash.

1.6

Added SeqFile class, which accepts either fastA or fastQ files. It uses FastaFile and FastqFile internally. You can use this class if you want your scripts to accept either fastA or fastQ files.

If you need the description and quality string, you should use FastqFile instead.

1.6.1

Better internal handling of empty sequences -- instead of raising errors, pass empty sequences.

1.6.2

FastaFile::open now raises a ParseFasta::DataFormatError when passed files that don't begin with a >.

1.5

Now accepts gzipped files. Huzzah!

1.4

Added methods:

Sequence.base_counts
Sequence.base_frequencies

1.3

Add additional functionality to each_record method.

Info

I often like to use the fasta format for other things like so

>fruits
pineapple
pear
peach
>veggies
peppers
parsnip
peas

rather than having this in a two column file like this

fruit,pineapple
fruit,pear
fruit,peach
veggie,peppers
veggie,parsnip
veggie,peas

So I added functionality to each_record to keep each line a record separate in an array. Here's an example using the above file.

info = []
FastaFile.open(f, 'r').each_record(1) do |header, lines|
  info << [header, lines]
end

Then info will contain the following arrays

['fruits', ['pineapple', 'pear', 'peach']],
['veggies', ['peppers', 'parsnip', 'peas']]

1.2

Added mean_qual method to the Quality class.

1.1.2

Dropped Ruby requirement to 1.9.3

(Note, if you want to build the docs with yard and you're using Ruby 1.9.3, you may have to install the redcarpet gem.)

1.1

Added: Fastq and Quality classes

1.0

Added: Fasta and Sequence classes

Removed: File monkey patch

0.0.5

Last version with File monkey patch.

Benchmark

Perhaps this isn't exactly fair since BioRuby is a big module with lots of features and error checking, whereas parse_fasta is meant to be lightweight and easy to use for my own research. Oh well ;)

FastaFile#each_record

You're probably wondering...How does it compare to BioRuby in some super accurate benchmarking tests? Lucky for you, I calculated sequence length for each fasta record with both the each_record method from this gem and using the FastaFormat class from BioRuby. You can see the test script in benchmark.rb.

The test file contained 2,009,897 illumina reads and the file size was 1.1 gigabytes. Here are the results from Ruby's Benchmark class:

                  user     system      total        real
parse_fasta  64.530000   1.740000  66.270000 ( 67.081502)
bioruby     116.250000   2.260000 118.510000 (120.223710)

Hot dog! It's faster :)

FastqFile#each_record

The same sequence length test as above, but this time with a fastq file containing 4,000,000 illumina reads.

                    user     system      total        real
this_fastq     62.610000   1.660000  64.270000 ( 64.389408)
bioruby_fastq 165.500000   2.100000 167.600000 (167.969636)

Sequence#gc

The test is done on random strings matcing /[AaCcTtGgUu]/. this_gc is Sequence.new(str).gc, and bioruby_gc is Bio::Sequence::NA.new(str).gc_content.

To see how the methods scales, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases.

                   user     system      total        real
this_gc 1      0.030000   0.000000   0.030000 (  0.029145)
bioruby_gc 1   2.030000   0.010000   2.040000 (  2.157512)

this_gc 2      0.060000   0.000000   0.060000 (  0.059408)
bioruby_gc 2   4.060000   0.020000   4.080000 (  4.334159)

this_gc 3      0.120000   0.000000   0.120000 (  0.185434)
bioruby_gc 3   8.060000   0.020000   8.080000 (  8.659071)

Nice!

Troll: "But Ryan, when will you find the GC of an 8,000,000 base sequence?"

Me: "Step off, troll!"

Test suite & docs

For a good time, you could clone this repo and run the test suite with rspec! Or if you just don't trust that it works like it should. The specs probably need a little clean up...so fork it and clean it up ;)

Same with the docs. Clone the repo and build them yourself with yard if you are in need of some excitement.

Notes

Only the SeqFile class actually checks to make sure that you passed in a "proper" fastA or fastQ file, so watch out.