Module: Bio::Sequence::Common

Included in:: AA, Generic, NA

Defined in:: lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb

Overview

DESCRIPTION

Bio::Sequence::Common is a Mixin implementing methods common to Bio::Sequence::AA and Bio::Sequence::NA. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence objects.

USAGE

# Create a sequence
dna = Bio::Sequence.auto('atgcatgcatgc')

# Splice out a subsequence using a Genbank-style location string
puts dna.splice('complement(1..4)')

# What is the base composition?
puts dna.composition

# Create a random sequence with the composition of a current sequence
puts dna.randomize

Class Method Summary collapse

.randomize(*arg, &block) ⇒ Object

Generate a new random sequence with the given frequency of bases.

Instance Method Summary collapse

#+(*arg) ⇒ Object

Create a new sequence by adding to an existing sequence.
#<<(*arg) ⇒ Object
#composition ⇒ Object

Returns a hash of the occurrence counts for each residue or base.
#concat(*arg) ⇒ Object

Add new data to the end of the current sequence.
#normalize! ⇒ Object (also: #seq!)

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA.
#randomize(hash = nil) ⇒ Object

Returns a randomized sequence.
#seq ⇒ Object

Create a new sequence based on the current sequence.
#splice(position) ⇒ Object (also: #splicing)

Return a new sequence extracted from the original using a GenBank style position string.
#subseq(s = 1, e = self.length) ⇒ Object

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters.
#to_fasta(header = '', width = nil) ⇒ Object

DEPRECIATED Do not use! Use Bio::Sequence#output instead.
#to_s ⇒ Object (also: #to_str)

Return sequence as String.
#total(hash) ⇒ Object

Returns a float total value for the sequence given a hash of base or residue values,.
#window_search(window_size, step_size = 1) ⇒ Object

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’.

Class Method Details

.randomize(*arg, &block) ⇒ `Object`

Generate a new random sequence with the given frequency of bases. The sequence length is determined by their cumulative sum. (See also Bio::Sequence::Common#randomize which creates a new randomized sequence object using the base composition of an existing sequence instance).

counts = {'R'=>1,'L'=>2,'E'=>3,'A'=>4}
puts Bio::Sequence::AA.randomize(counts)  #=> "AAEAELALRE" (for example)

You may also feed the output of randomize into a block

actual_counts = {'R'=>0,'L'=>0,'E'=>0,'A'=>0}
Bio::Sequence::AA.randomize(counts) {|x| actual_counts[x] += 1}
actual_counts                     #=> {"A"=>4, "L"=>2, "E"=>3, "R"=>1}

Arguments:

(optional) hash: Hash object

Returns: Bio::Sequence::NA/AA object



289
290
291

# File 'lib/bio/sequence/common.rb', line 289

def self.randomize(*arg, &block)
  self.new('').randomize(*arg, &block)
end

Instance Method Details

#+(*arg) ⇒ `Object`

Create a new sequence by adding to an existing sequence. The existing sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
s2 = s + 'atgc'
puts s2                                 #=> "atgcatgc"
puts s                                  #=> "atgc"

The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,

puts s2.class == s.class                #=> true

but if an existing sequence is added to a String, the result is a String

s3 = 'atgc' + s
puts s3.class                           #=> String

Returns: new Bio::Sequence::NA/AA or String object



121
122
123

# File 'lib/bio/sequence/common.rb', line 121

def +(*arg)
  self.class.new(super(*arg))
end

#<<(*arg) ⇒ `Object`



98
99
100

# File 'lib/bio/sequence/common.rb', line 98

def <<(*arg)
  concat(*arg)
end

#composition ⇒ `Object`

Returns a hash of the occurrence counts for each residue or base.

s = Bio::Sequence::NA.new('atgc')
puts s.composition              #=> {"a"=>1, "c"=>1, "g"=>1, "t"=>1}

Returns: Hash object

# File 'lib/bio/sequence/common.rb', line 215

def composition
  count = Hash.new(0)
  self.scan(/./) do |x|
    count[x] += 1
  end
  return count
end

#concat(*arg) ⇒ `Object`

Add new data to the end of the current sequence. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s << 'atgc'
puts s                                  #=> "atgcatgc"
s << s
puts s                                  #=> "atgcatgcatgcatgc"

Returns: current Bio::Sequence::NA/AA object (modified)



94
95
96

# File 'lib/bio/sequence/common.rb', line 94

def concat(*arg)
  super(self.class.new(*arg))
end

#normalize! ⇒ `Object` Also known as: seq!

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s.normalize!

Returns: current Bio::Sequence::NA/AA object (modified)

# File 'lib/bio/sequence/common.rb', line 78

def normalize!
  initialize(self)
  self
end

#randomize(hash = nil) ⇒ `Object`

Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.randomize                        #=> "tcag"  (for example)

new_composition = {'a' => 2, 't' => 2}
puts s.randomize(new_composition)       #=> "ttaa"  (for example)

count = 0
s.randomize { |x| count += 1 }
puts count                              #=> 4

Arguments:

(optional) hash: Hash object

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 243

def randomize(hash = nil)
  length = self.length
  if hash
    length = 0
    count = hash.clone
    count.each_value {|x| length += x}
  else
    count = self.composition
  end

  seq = ''
  tmp = {}
  length.times do 
    count.each do |k, v|
      tmp[k] = v * rand
    end
    max = tmp.max {|a, b| a[1] <=> b[1]}
    count[max.first] -= 1

    if block_given?
      yield max.first
    else
      seq += max.first
    end
  end
  return self.class.new(seq)
end

#seq ⇒ `Object`

Create a new sequence based on the current sequence. The original sequence is unchanged.

s = Bio::Sequence::NA.new('atgc')
s2 = s.seq
puts s2                                 #=> 'atgc'

Returns: new Bio::Sequence::NA/AA object



65
66
67

# File 'lib/bio/sequence/common.rb', line 65

def seq
  self.class.new(self)
end

#splice(position) ⇒ `Object` Also known as: splicing

Return a new sequence extracted from the original using a GenBank style position string. See also documentation for the Bio::Location class.

s = Bio::Sequence::NA.new('atgcatgcatgcatgc')
puts s.splice('1..3')                           #=> "atg"
puts s.splice('join(1..3,8..10)')               #=> "atgcat"
puts s.splice('complement(1..3)')               #=> "cat"
puts s.splice('complement(join(1..3,8..10))')   #=> "atgcat"

Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA objects.

Arguments:

(required) position: String or Bio::Location object

Returns: Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 308

def splice(position)
  unless position.is_a?(Locations) then
    position = Locations.new(position)
  end
  s = ''
  position.each do |location|
    if location.sequence
      s << location.sequence
    else
      exon = self.subseq(location.from, location.to)
      begin
        exon.complement! if location.strand < 0
      rescue NameError
      end
      s << exon
    end
  end
  return self.class.new(s)
end

#subseq(s = 1, e = self.length) ⇒ `Object`

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.

s = Bio::Sequence::NA.new('atggaatga')
puts s.subseq(1,3)                      #=> "atg"

Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.

puts s.subseq                           #=> "atggaatga"

Arguments:

(optional) s(start): Integer (default 1)
(optional) e(end): Integer (default current sequence length)

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 143

def subseq(s = 1, e = self.length)
  raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0
  s -= 1
  e -= 1
  self[s..e]
end

#to_fasta(header = '', width = nil) ⇒ `Object`

DEPRECIATED Do not use! Use Bio::Sequence#output instead.

Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.

Arguments:

(optional) header: String object
(optional) width: Fixnum object (default nil)

Returns: String

# File 'lib/bio/sequence/compat.rb', line 50

def to_fasta(header = '', width = nil)
  warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG
  ">#{header}\n" +
  if width
    self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n")
  else
    self.to_s + "\n"
  end
end

#to_s ⇒ `Object` Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence::NA.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence::NA

Returns: String object



52
53
54

# File 'lib/bio/sequence/common.rb', line 52

def to_s
  String.new(self)
end

#total(hash) ⇒ `Object`

Returns a float total value for the sequence given a hash of base or residue values,

values = {'a' => 0.1, 't' => 0.2, 'g' => 0.3, 'c' => 0.4}
s = Bio::Sequence::NA.new('atgc')
puts s.total(values)                    #=> 1.0

Arguments:

(required) hash: Hash object

Returns: Float object

# File 'lib/bio/sequence/common.rb', line 198

def total(hash)
  hash.default = 0.0 unless hash.default
  sum = 0.0
  self.each_byte do |x|
    begin
      sum += hash[x.chr]
    end
  end
  return sum
end

#window_search(window_size, step_size = 1) ⇒ `Object`

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.

Prints average GC% on each 100bp

s.window_search(100) do |subseq|
  puts subseq.gc
end

Prints every translated peptide (length 5aa) in the same frame

s.window_search(15, 3) do |subseq|
  puts subseq.translate
end

Split genome sequence by 10000bp with 1000bp overlap in fasta format

i = 1
remainder = s.window_search(10000, 9000) do |subseq|
  puts subseq.to_fasta("segment #{i}", 60)
  i += 1
end
puts remainder.to_fasta("segment #{i}", 60)

Arguments:

(required) window_size: Fixnum
(optional) step_size: Fixnum (default 1)

Returns: new Bio::Sequence::NA/AA object

# File 'lib/bio/sequence/common.rb', line 179

def window_search(window_size, step_size = 1)
  last_step = 0
  0.step(self.length - window_size, step_size) do |i| 
    yield self[i, window_size]                        
    last_step = i
  end                          
  return self[last_step + window_size .. -1] 
end

Module: Bio::Sequence::Common

Overview

DESCRIPTION

USAGE

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.randomize(*arg, &block) ⇒ Object

Instance Method Details

#+(*arg) ⇒ Object

#<<(*arg) ⇒ Object

#composition ⇒ Object

#concat(*arg) ⇒ Object

#normalize! ⇒ Object Also known as: seq!

#randomize(hash = nil) ⇒ Object

#seq ⇒ Object

#splice(position) ⇒ Object Also known as: splicing

#subseq(s = 1, e = self.length) ⇒ Object

#to_fasta(header = '', width = nil) ⇒ Object

#to_s ⇒ Object Also known as: to_str

#total(hash) ⇒ Object

#window_search(window_size, step_size = 1) ⇒ Object

.randomize(*arg, &block) ⇒ `Object`

#+(*arg) ⇒ `Object`

#<<(*arg) ⇒ `Object`

#composition ⇒ `Object`

#concat(*arg) ⇒ `Object`

#normalize! ⇒ `Object` Also known as: seq!

#randomize(hash = nil) ⇒ `Object`

#seq ⇒ `Object`

#splice(position) ⇒ `Object` Also known as: splicing

#subseq(s = 1, e = self.length) ⇒ `Object`

#to_fasta(header = '', width = nil) ⇒ `Object`

#to_s ⇒ `Object` Also known as: to_str

#total(hash) ⇒ `Object`

#window_search(window_size, step_size = 1) ⇒ `Object`