Module: Bio::Sequence::Common

Included in:: AA, Generic, NA

Defined in:: lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb

Overview

DESCRIPTION

Bio::Sequence::Common is a Mixin implementing methods common to Bio::Sequence::AA and Bio::Sequence::NA. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence objects.

USAGE

# Create a sequence
dna = Bio::Sequence.auto('atgcatgcatgc')

# Splice out a subsequence using a Genbank-style location string
puts dna.splice('complement(1..4)')

# What is the base composition?
puts dna.composition

# Create a random sequence with the composition of a current sequence
puts dna.randomize

Instance Method Summary collapse

#+(*arg) ⇒ Object

Create a new sequence by adding to an existing sequence.
#<<(*arg) ⇒ Object
#composition ⇒ Object

Returns a hash of the occurrence counts for each residue or base.
#concat(*arg) ⇒ Object

Add new data to the end of the current sequence.
#normalize! ⇒ Object (also: #seq!)

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA.
#randomize(hash = nil) ⇒ Object

Returns a randomized sequence.
#seq ⇒ Object

Create a new sequence based on the current sequence.
#splice(position) ⇒ Object (also: #splicing)

Return a new sequence extracted from the original using a GenBank style position string.
#subseq(s = 1, e = self.length) ⇒ Object

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters.
#to_fasta(header = '', width = nil) ⇒ Object

Bio::Sequence#to_fasta is DEPRECIATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead.
#to_s ⇒ Object (also: #to_str)

Return sequence as String.
#total(hash) ⇒ Object

Returns a float total value for the sequence given a hash of base or residue values,.
#window_search(window_size, step_size = 1) ⇒ Object

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’.

Instance Method Details

#+(*arg) ⇒ `Object`

Create a new sequence by adding to an existing sequence. The existing sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
s2 = s + 'atgc'
puts s2                                 #=> "atgcatgc"
puts s                                  #=> "atgc"

The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,

puts s2.class == s.class                #=> true

but if an existing sequence is added to a String, the result is a String

s3 = 'atgc' + s
puts s3.class                           #=> String

Returns: new Bio::Sequence::NA/AA or String object



121
122
123

# File 'lib/bio/sequence/common.rb', line 121

def +(*arg)
  self.class.new(super(*arg))
end

#<<(*arg) ⇒ `Object`



98
99
100

# File 'lib/bio/sequence/common.rb', line 98

def <<(*arg)
  concat(*arg)
end

#composition ⇒ `Object`

Returns a hash of the occurrence counts for each residue or base.

s = Bio::Sequence::NA.new('atgc')
puts s.composition              #=> {"a"=>1, "c"=>1, "g"=>1, "t"=>1}

Returns: Hash object

# File 'lib/bio/sequence/common.rb', line 215

def composition
  count = Hash.new(0)
  self.scan(/./) do |x|
    count[x] += 1
  end
  return count
end

#concat(*arg) ⇒ `Object`

Add new data to the end of the current sequence. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s << 'atgc'
puts s                                  #=> "atgcatgc"
s << s
puts s                                  #=> "atgcatgcatgcatgc"

Returns: current Bio::Sequence::NA/AA object (modified)



94
95
96

# File 'lib/bio/sequence/common.rb', line 94

def concat(*arg)
  super(self.class.new(*arg))
end

#normalize! ⇒ `Object` Also known as: seq!

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s.normalize!

Returns: current Bio::Sequence::NA/AA object (modified)

# File 'lib/bio/sequence/common.rb', line 78

def normalize!
  initialize(self)
  self
end

#randomize(hash = nil) ⇒ `Object`

Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.randomize                        #=> "tcag"  (for example)

new_composition = {'a' => 2, 't' => 2}
puts s.randomize(new_composition)       #=> "ttaa"  (for example)

count = 0
s.randomize { |x| count += 1 }
puts count                              #=> 4

Arguments:

(optional) hash: Hash object

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 243

def randomize(hash = nil)
  if hash
    tmp = ''
    hash.each {|k, v|
      tmp += k * v.to_i
    }
  else
    tmp = self
  end
  seq = self.class.new(tmp)
  # Reference: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
  seq.length.downto(2) do |n|
    k = rand(n)
    c = seq[n - 1]
    seq[n - 1] = seq[k]
    seq[k] = c
  end
  if block_given? then
    (0...seq.length).each do |i|
      yield seq[i, 1]
    end
    return self.class.new('')
  else
    return seq
  end
end

#seq ⇒ `Object`

Create a new sequence based on the current sequence. The original sequence is unchanged.

s = Bio::Sequence::NA.new('atgc')
s2 = s.seq
puts s2                                 #=> 'atgc'

Returns: new Bio::Sequence::NA/AA object



65
66
67

# File 'lib/bio/sequence/common.rb', line 65

def seq
  self.class.new(self)
end

#splice(position) ⇒ `Object` Also known as: splicing

Return a new sequence extracted from the original using a GenBank style position string. See also documentation for the Bio::Location class.

s = Bio::Sequence::NA.new('atgcatgcatgcatgc')
puts s.splice('1..3')                           #=> "atg"
puts s.splice('join(1..3,8..10)')               #=> "atgcat"
puts s.splice('complement(1..3)')               #=> "cat"
puts s.splice('complement(join(1..3,8..10))')   #=> "atgcat"

Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA objects.

Arguments:

(required) position: String or Bio::Location object

Returns: Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 285

def splice(position)
  unless position.is_a?(Locations) then
    position = Locations.new(position)
  end
  s = ''
  position.each do |location|
    if location.sequence
      s << location.sequence
    else
      exon = self.subseq(location.from, location.to)
      begin
        exon.complement! if location.strand < 0
      rescue NameError
      end
      s << exon
    end
  end
  return self.class.new(s)
end

#subseq(s = 1, e = self.length) ⇒ `Object`

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.

s = Bio::Sequence::NA.new('atggaatga')
puts s.subseq(1,3)                      #=> "atg"

Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.

puts s.subseq                           #=> "atggaatga"

Arguments:

(optional) s(start): Integer (default 1)
(optional) e(end): Integer (default current sequence length)

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 143

def subseq(s = 1, e = self.length)
  raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0
  s -= 1
  e -= 1
  self[s..e]
end

#to_fasta(header = '', width = nil) ⇒ `Object`

Bio::Sequence#to_fasta is DEPRECIATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead. Note that Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasata, and Bio::Sequence::Generic#to_fasta can still be used, because there are no alternative methods.

Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.

Arguments:

(optional) header: String object
(optional) width: Fixnum object (default nil)

Returns: String

# File 'lib/bio/sequence/compat.rb', line 54

def to_fasta(header = '', width = nil)
  warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG
  ">#{header}\n" +
  if width
    self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n")
  else
    self.to_s + "\n"
  end
end

#to_s ⇒ `Object` Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence::NA.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence::NA

Returns: String object



52
53
54

# File 'lib/bio/sequence/common.rb', line 52

def to_s
  String.new(self)
end

#total(hash) ⇒ `Object`

Returns a float total value for the sequence given a hash of base or residue values,

values = {'a' => 0.1, 't' => 0.2, 'g' => 0.3, 'c' => 0.4}
s = Bio::Sequence::NA.new('atgc')
puts s.total(values)                    #=> 1.0

Arguments:

(required) hash: Hash object

Returns: Float object

# File 'lib/bio/sequence/common.rb', line 198

def total(hash)
  hash.default = 0.0 unless hash.default
  sum = 0.0
  self.each_byte do |x|
    begin
      sum += hash[x.chr]
    end
  end
  return sum
end

#window_search(window_size, step_size = 1) ⇒ `Object`

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.

Prints average GC% on each 100bp

s.window_search(100) do |subseq|
  puts subseq.gc
end

Prints every translated peptide (length 5aa) in the same frame

s.window_search(15, 3) do |subseq|
  puts subseq.translate
end

Split genome sequence by 10000bp with 1000bp overlap in fasta format

i = 1
remainder = s.window_search(10000, 9000) do |subseq|
  puts subseq.to_fasta("segment #{i}", 60)
  i += 1
end
puts remainder.to_fasta("segment #{i}", 60)

Arguments:

(required) window_size: Fixnum
(optional) step_size: Fixnum (default 1)

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 179

def window_search(window_size, step_size = 1)
  last_step = 0
  0.step(self.length - window_size, step_size) do |i| 
    yield self[i, window_size]                        
    last_step = i
  end                          
  return self[last_step + window_size .. -1] 
end

Module: Bio::Sequence::Common

Overview

DESCRIPTION

USAGE

Instance Method Summary collapse

Instance Method Details

#+(*arg) ⇒ Object

#<<(*arg) ⇒ Object

#composition ⇒ Object

#concat(*arg) ⇒ Object

#normalize! ⇒ Object Also known as: seq!

#randomize(hash = nil) ⇒ Object

#seq ⇒ Object

#splice(position) ⇒ Object Also known as: splicing

#subseq(s = 1, e = self.length) ⇒ Object

#to_fasta(header = '', width = nil) ⇒ Object

#to_s ⇒ Object Also known as: to_str

#total(hash) ⇒ Object

#window_search(window_size, step_size = 1) ⇒ Object

#+(*arg) ⇒ `Object`

#<<(*arg) ⇒ `Object`

#composition ⇒ `Object`

#concat(*arg) ⇒ `Object`

#normalize! ⇒ `Object` Also known as: seq!

#randomize(hash = nil) ⇒ `Object`

#seq ⇒ `Object`

#splice(position) ⇒ `Object` Also known as: splicing

#subseq(s = 1, e = self.length) ⇒ `Object`

#to_fasta(header = '', width = nil) ⇒ `Object`

#to_s ⇒ `Object` Also known as: to_str

#total(hash) ⇒ `Object`

#window_search(window_size, step_size = 1) ⇒ `Object`