ParseFasta

Gem Version Build Status Coverage Status

So you want to parse a fasta file...

Installation

Add this line to your application's Gemfile:

gem 'parse_fasta'

And then execute:

$ bundle

Or install it yourself as:

$ gem install parse_fasta

JRuby

ParseFasta doesn't work with JRuby for now D:

Overview

Provides nice, programmatic access to fasta and fastq files. It's faster and more lightweight than BioRuby. And more fun!

It takes care of a lot of whacky edge cases like parsing multi-blob gzipped files, and being strict on formatting by default.

Documentation

Checkout parse_fasta docs for the full api documentation.

Usage

Here are some examples of using ParseFasta. Don't forget to require "parse_fasta" at the top of your program!

Print header and length of each record.

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  puts [rec.header, rec.seq.length].join "\t"
end

You can parse fastQ files in exatcly the same way.

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  printf "Header: %s, Sequence: %s, Description: %s, Quality: %s\n",
         rec.header,
         rec.seq,
         rec.desc,
         rec.qual
end

The Record#desc and Record#qual will be nil if the file you are parsing is a fastA file.

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  if rec.qual
    # it's a fastQ record
  else
    # it's a fastA record
  end
end

You can also check this with Record#fastq?

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  if rec.fastq?
    # it's a fastQ record
  else
    # it's a fastA record
  end
end

And there is a nice #to_s method, that does what it should whether the record is fastA or fastQ like. Check out the docs for info on the fancy #to_fasta and #to_fastq methods!

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  puts rec.to_s
end

But of course, since it is a #to_s override...you don't even have to call it directly!

ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
  puts rec
end

Sometimes your fasta file might have record separators (>) withen the "sequence". For example, CD-HIT's .clstr files have headers within what would be the sequence part of the record. ParseFasta is really strict about formatting and will raise an error when trying to read these types of files. If you would like to parse them, use the check_fasta_seq: false flag like so:

ParseFasta::SeqFile.open(ARGV[0], check_fasta_seq: false).each_record do |rec|
  puts rec
end