Module: Bio::Sequence::Common

Included in:
AA, Generic, NA
Defined in:
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb

Overview

DESCRIPTION

Bio::Sequence::Common is a Mixin implementing methods common to Bio::Sequence::AA and Bio::Sequence::NA. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence objects.

USAGE

# Create a sequence
dna = Bio::Sequence.auto('atgcatgcatgc')

# Splice out a subsequence using a Genbank-style location string
puts dna.splice('complement(1..4)')

# What is the base composition?
puts dna.composition

# Create a random sequence with the composition of a current sequence
puts dna.randomize

Instance Method Summary collapse

Instance Method Details

#+(*arg) ⇒ Object

Create a new sequence by adding to an existing sequence. The existing sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
s2 = s + 'atgc'
puts s2                                 #=> "atgcatgc"
puts s                                  #=> "atgc"

The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,

puts s2.class == s.class                #=> true

but if an existing sequence is added to a String, the result is a String

s3 = 'atgc' + s
puts s3.class                           #=> String

Returns

new Bio::Sequence::NA/AA or String object



121
122
123
# File 'lib/bio/sequence/common.rb', line 121

def +(*arg)
  self.class.new(super(*arg))
end

#<<(*arg) ⇒ Object



98
99
100
# File 'lib/bio/sequence/common.rb', line 98

def <<(*arg)
  concat(*arg)
end

#compositionObject

Returns a hash of the occurrence counts for each residue or base.

s = Bio::Sequence::NA.new('atgc')
puts s.composition              #=> {"a"=>1, "c"=>1, "g"=>1, "t"=>1}

Returns

Hash object



215
216
217
218
219
220
221
# File 'lib/bio/sequence/common.rb', line 215

def composition
  count = Hash.new(0)
  self.scan(/./) do |x|
    count[x] += 1
  end
  return count
end

#concat(*arg) ⇒ Object

Add new data to the end of the current sequence. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s << 'atgc'
puts s                                  #=> "atgcatgc"
s << s
puts s                                  #=> "atgcatgcatgcatgc"

Returns

current Bio::Sequence::NA/AA object (modified)



94
95
96
# File 'lib/bio/sequence/common.rb', line 94

def concat(*arg)
  super(self.class.new(*arg))
end

#normalize!Object Also known as: seq!

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s.normalize!

Returns

current Bio::Sequence::NA/AA object (modified)



78
79
80
81
# File 'lib/bio/sequence/common.rb', line 78

def normalize!
  initialize(self)
  self
end

#randomize(hash = nil) ⇒ Object

Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.randomize                        #=> "tcag"  (for example)

new_composition = {'a' => 2, 't' => 2}
puts s.randomize(new_composition)       #=> "ttaa"  (for example)

count = 0
s.randomize { |x| count += 1 }
puts count                              #=> 4

Arguments:

  • (optional) hash: Hash object

Returns

new Bio::Sequence::NA/AA object



243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# File 'lib/bio/sequence/common.rb', line 243

def randomize(hash = nil)
  if hash
    tmp = ''
    hash.each {|k, v|
      tmp += k * v.to_i
    }
  else
    tmp = self
  end
  seq = self.class.new(tmp)
  # Reference: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
  seq.length.downto(2) do |n|
    k = rand(n)
    c = seq[n - 1]
    seq[n - 1] = seq[k]
    seq[k] = c
  end
  if block_given? then
    (0...seq.length).each do |i|
      yield seq[i, 1]
    end
    return self.class.new('')
  else
    return seq
  end
end

#seqObject

Create a new sequence based on the current sequence. The original sequence is unchanged.

s = Bio::Sequence::NA.new('atgc')
s2 = s.seq
puts s2                                 #=> 'atgc'

Returns

new Bio::Sequence::NA/AA object



65
66
67
# File 'lib/bio/sequence/common.rb', line 65

def seq
  self.class.new(self)
end

#splice(position) ⇒ Object Also known as: splicing

Return a new sequence extracted from the original using a GenBank style position string. See also documentation for the Bio::Location class.

s = Bio::Sequence::NA.new('atgcatgcatgcatgc')
puts s.splice('1..3')                           #=> "atg"
puts s.splice('join(1..3,8..10)')               #=> "atgcat"
puts s.splice('complement(1..3)')               #=> "cat"
puts s.splice('complement(join(1..3,8..10))')   #=> "atgcat"

Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA objects.


Arguments:

  • (required) position: String or Bio::Location object

Returns

Bio::Sequence::NA/AA object



285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# File 'lib/bio/sequence/common.rb', line 285

def splice(position)
  unless position.is_a?(Locations) then
    position = Locations.new(position)
  end
  s = ''
  position.each do |location|
    if location.sequence
      s << location.sequence
    else
      exon = self.subseq(location.from, location.to)
      begin
        exon.complement! if location.strand < 0
      rescue NameError
      end
      s << exon
    end
  end
  return self.class.new(s)
end

#subseq(s = 1, e = self.length) ⇒ Object

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.

s = Bio::Sequence::NA.new('atggaatga')
puts s.subseq(1,3)                      #=> "atg"

Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.

puts s.subseq                           #=> "atggaatga"

Arguments:

  • (optional) s(start): Integer (default 1)

  • (optional) e(end): Integer (default current sequence length)

Returns

new Bio::Sequence::NA/AA object



143
144
145
146
147
148
# File 'lib/bio/sequence/common.rb', line 143

def subseq(s = 1, e = self.length)
  raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0
  s -= 1
  e -= 1
  self[s..e]
end

#to_fasta(header = '', width = nil) ⇒ Object

Bio::Sequence#to_fasta is DEPRECIATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead. Note that Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasata, and Bio::Sequence::Generic#to_fasta can still be used, because there are no alternative methods.

Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.


Arguments:

  • (optional) header: String object

  • (optional) width: Fixnum object (default nil)

Returns

String



49
50
51
52
53
54
55
56
57
# File 'lib/bio/sequence/compat.rb', line 49

def to_fasta(header = '', width = nil)
  warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG
  ">#{header}\n" +
  if width
    self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n")
  else
    self.to_s + "\n"
  end
end

#to_sObject Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence::NA.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence::NA

Returns

String object



52
53
54
# File 'lib/bio/sequence/common.rb', line 52

def to_s
  String.new(self)
end

#total(hash) ⇒ Object

Returns a float total value for the sequence given a hash of base or residue values,

values = {'a' => 0.1, 't' => 0.2, 'g' => 0.3, 'c' => 0.4}
s = Bio::Sequence::NA.new('atgc')
puts s.total(values)                    #=> 1.0

Arguments:

  • (required) hash: Hash object

Returns

Float object



198
199
200
201
202
203
204
205
206
207
# File 'lib/bio/sequence/common.rb', line 198

def total(hash)
  hash.default = 0.0 unless hash.default
  sum = 0.0
  self.each_byte do |x|
    begin
      sum += hash[x.chr]
    end
  end
  return sum
end

#window_search(window_size, step_size = 1) ⇒ Object

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.

Prints average GC% on each 100bp

s.window_search(100) do |subseq|
  puts subseq.gc
end

Prints every translated peptide (length 5aa) in the same frame

s.window_search(15, 3) do |subseq|
  puts subseq.translate
end

Split genome sequence by 10000bp with 1000bp overlap in fasta format

i = 1
remainder = s.window_search(10000, 9000) do |subseq|
  puts subseq.to_fasta("segment #{i}", 60)
  i += 1
end
puts remainder.to_fasta("segment #{i}", 60)

Arguments:

  • (required) window_size: Fixnum

  • (optional) step_size: Fixnum (default 1)

Returns

new Bio::Sequence::NA/AA object



179
180
181
182
183
184
185
186
# File 'lib/bio/sequence/common.rb', line 179

def window_search(window_size, step_size = 1)
  last_step = 0
  0.step(self.length - window_size, step_size) do |i| 
    yield self[i, window_size]                        
    last_step = i
  end                          
  return self[last_step + window_size .. -1] 
end