bio-maf

Build Status

This is a plugin for BioRuby adding support for the Multiple Alignment Format (MAF), used in bioinformatics to store whole-genome sets of multiple sequence alignments.

Ultimately it will provide indexed and sequential access to MAF data, as well as performing various manipulations on it and writing modified MAF files. So far, it only supports simple sequential parsing.

For more information, see the project wiki.

Developer documentation generated with YARD is available at rubydoc.info.

This is being developed by Clayton Wheeler as part of the Google Summer of Code 2012, under the auspices of the Open Bioinformatics Foundation. The development blog may be of interest.

Dependencies

Kyoto Cabinet is a database library, required for building MAF indexes. Install the core library in the appropriate way for your platform, as documented here.

If you're using MRI, the kyotocabinet-ruby gem will be used to interact with Kyoto Cabinet. For best performance, however, you should really consider using JRuby. On JRuby, the kyotocabinet-java gem will be used instead; this builds a Java library using JNI to call into Kyoto Cabinet. Please file a bug report if you encounter problems building or using this gem, which is still fairly new.

Installation

bio-maf is now published as a Ruby gem.

$ gem install bio-maf

Performance

This parser performs best under JRuby, particularly with Java

  1. See the Performance wiki page for more information. For best results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:

    $ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'

Many parsing modes are multithreaded. Under JRuby, it will default to using one parser thread per available core, but if desired this can be configured with the :threads parser option.

Ruby 1.9.3 is fully supported, but does not perform as well, especially since its concurrency features are not useful for this workload.

Usage

Create an index on a MAF file

Much of the functionality of this library relies on an index. You can create one with maf_index(1), like so:

$ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct

Or programmatically:

require 'bio-maf'
parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct")

Extract blocks from an indexed MAF file, by genomic interval

Refer to mm8_chr7_tiny.maf.

require 'bio-maf'
parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
idx.find(q, parser).each do |block|
  ref_seq = block.sequences[0]
  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
end

# => Matched block at 80082592, 121 bases
# => Matched block at 80082713, 54 bases

Filter species returned in alignment blocks

require 'bio-maf'
parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')

parser.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
blocks = idx.find(q, parser)
block = blocks.first
puts "Block has #{block.sequences.size} sequences."

# => Block has 3 sequences.

Extract blocks matching certain conditions

See also the Cucumber feature and step definitions for this.

Match only blocks with all specified species

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
filter = { :with_all_species => %w(panTro2 loxAfr1) }
n_blocks = idx.find(q, parser, filter).count
# => 1

Match only blocks with a certain number of sequences

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
filter = { :at_least_n_sequences => 6 }
n_blocks = idx.find(q, parser, filter).count
# => 1

Match only blocks within a text size range

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
filter = { :min_size => 72, :max_size => 160 }
n_blocks = idx.find(q, parser, filter).count
# => 3

Process each block in a MAF file

require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
puts "MAF version: #{p.header.version}"
# => MAF version: 1

p.parse_blocks.each do |block|
  block.sequences.each do |seq|
    do_something(seq)
  end
end

Parse empty ('e') lines

Refer to chr22_ieq.maf.

require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
                         :parse_empty => false)
block = p.parse_block
block.sequences.size
# => 3

p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
                         :parse_empty => true)
block = p.parse_block
block.sequences.size
# => 4
block.sequences.find { |s| s.empty? }
# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0 
#      @source="turTru1.scaffold_109008", @start=25049,
#      @size=1601, @strand=:+, @src_size=50103, @text=nil,
#      @status="I"> 

Remove gaps from parsed blocks

After filtering out species with Parser#sequence_filter, gaps may be left where there was an insertion present only in sequences that were filtered out. Such gaps can be removed by setting the :remove_gaps parser option:

require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
                         :remove_gaps => true)

Tile blocks together over an interval

Extracts alignment blocks overlapping the given genomic interval and constructs a single alignment block covering the entire interval for the specified species. Optionally, any gaps in coverage of the MAF file's reference sequence can be filled in from a FASTA sequence file. See the Cucumber feature for examples of output, and also the maf_tile(1) man page.

require 'bio-maf'
tiler = Bio::MAF::Tiler.new
tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
# optional
tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
tiler.species = %w(mm8 rn4 hg18)
tiler.species_map = {
  'mm8' => 'mouse',
  'rn4' => 'rat',
  'hg18' => 'human'
}
tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
                                                 80082334,
                                                 80082468)
tiler.write_fasta($stdout)

Command line tools

Man pages for command line tools:

With gem-man installed, these can be read with:

$ gem man bio-maf

Other documentation

Also see the API documentation. For more code examples see the RSpec and Cucumber test files in the source tree.

Also, the scripts in the bin directory provide good worked examples of how to use the existing parsing API.

Project home page

For information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/csw/bioruby-maf

The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at biogems.info.

Copyright (c) 2012 Clayton Wheeler. See LICENSE.txt for further details.