Class: Bio::Sequence

Inherits:
Object show all
Includes:
Format
Defined in:
lib/bio/sequence.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/generic.rb

Overview

DESCRIPTION

Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).

USAGE

# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')

# Print it out
puts dna.to_s
puts aa.to_s

# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)

# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]

# Print in FASTA format
puts dna.output(:fasta)

# Print all codons
dna.window_search(3,3) do |codon|
  puts codon
end

# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")

# Convert a sequence containing ambiguity codes into a 
# regular expression you can use for subsequent searching
puts aa.to_re

# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent

Defined Under Namespace

Modules: Adapter, Common, Format Classes: AA, DBLink, Generic, NA

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Format

#list_output_formats, #output

Constructor Details

#initialize(str) ⇒ Sequence

Create a new Bio::Sequence object

s = Bio::Sequence.new('atgc')
puts s                                  #=> 'atgc'

Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string

puts s.seq.class                        #=> String

See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object


Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



94
95
96
# File 'lib/bio/sequence.rb', line 94

def initialize(str)
  @seq = str
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object

Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing



100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/bio/sequence.rb', line 100

def method_missing(sym, *args, &block) #:nodoc:
  begin
    seq.__send__(sym, *args, &block)
  rescue NoMethodError => evar
    lineno = __LINE__ - 2
    file = __FILE__
    bt_here = [ "#{file}:#{lineno}:in \`__send__\'",
                "#{file}:#{lineno}:in \`method_missing\'"
              ]
    if bt_here == evar.backtrace[0, 2] then
      bt = evar.backtrace[2..-1]
      evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}")
      evar.set_backtrace(bt)
    end
    #p lineno
    #p file
    #p bt_here
    #p evar.backtrace
    raise(evar)
  end
end

Instance Attribute Details

#classificationObject Also known as: taxonomy

Organism classification, taxonomic classification of the source organism. (Array of String)



214
215
216
# File 'lib/bio/sequence.rb', line 214

def classification
  @classification
end

#commentsObject

Comments (String or an Array of String)



137
138
139
# File 'lib/bio/sequence.rb', line 137

def comments
  @comments
end

#data_classObject

Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1



176
177
178
# File 'lib/bio/sequence.rb', line 176

def data_class
  @data_class
end

#date_createdObject

Created date of the sequence entry (Date, DateTime, Time, or String)



189
190
191
# File 'lib/bio/sequence.rb', line 189

def date_created
  @date_created
end

#date_modifiedObject

Last modified date of the sequence entry (Date, DateTime, Time, or String)



192
193
194
# File 'lib/bio/sequence.rb', line 192

def date_modified
  @date_modified
end

Links to other database entries. (An Array of Bio::Sequence::DBLink objects)



144
145
146
# File 'lib/bio/sequence.rb', line 144

def dblinks
  @dblinks
end

#definitionObject

A String with a description of the sequence (String)



128
129
130
# File 'lib/bio/sequence.rb', line 128

def definition
  @definition
end

#divisionObject

Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2



180
181
182
# File 'lib/bio/sequence.rb', line 180

def division
  @division
end

#entry_idObject

The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.



125
126
127
# File 'lib/bio/sequence.rb', line 125

def entry_id
  @entry_id
end

#entry_versionObject

Version of the entry (String or Integer). Unlike sequence_version, entry_version is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.



207
208
209
# File 'lib/bio/sequence.rb', line 207

def entry_version
  @entry_version
end

#featuresObject

Features (An Array of Bio::Feature objects)



131
132
133
# File 'lib/bio/sequence.rb', line 131

def features
  @features
end

#id_namespaceObject

Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.



223
224
225
# File 'lib/bio/sequence.rb', line 223

def id_namespace
  @id_namespace
end

#keywordsObject

Keywords (An Array of String)



140
141
142
# File 'lib/bio/sequence.rb', line 140

def keywords
  @keywords
end

#molecule_typeObject

molecular type (String). “DNA” or “RNA” for nucleotide sequence.



172
173
174
# File 'lib/bio/sequence.rb', line 172

def molecule_type
  @molecule_type
end

#moltypeObject

Bio::Sequence::NA/AA



147
148
149
# File 'lib/bio/sequence.rb', line 147

def moltype
  @moltype
end

#organelleObject

(not well supported) Organelle information (String).



218
219
220
# File 'lib/bio/sequence.rb', line 218

def organelle
  @organelle
end

#other_seqidsObject

Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks should be used.



231
232
233
# File 'lib/bio/sequence.rb', line 231

def other_seqids
  @other_seqids
end

#primary_accessionObject

Primary accession number (String)



183
184
185
# File 'lib/bio/sequence.rb', line 183

def primary_accession
  @primary_accession
end

#referencesObject

References (An Array of Bio::Reference objects)



134
135
136
# File 'lib/bio/sequence.rb', line 134

def references
  @references
end

#release_createdObject

Release information when created (String)



195
196
197
# File 'lib/bio/sequence.rb', line 195

def release_created
  @release_created
end

#release_modifiedObject

Release information when last-modified (String)



198
199
200
# File 'lib/bio/sequence.rb', line 198

def release_modified
  @release_modified
end

#secondary_accessionsObject

Secondary accession numbers (Array of String)



186
187
188
# File 'lib/bio/sequence.rb', line 186

def secondary_accessions
  @secondary_accessions
end

#seqObject

The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String



151
152
153
# File 'lib/bio/sequence.rb', line 151

def seq
  @seq
end

#sequence_versionObject

Version number of the sequence (String or Integer). Unlike entry_version, sequence_version will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.



162
163
164
# File 'lib/bio/sequence.rb', line 162

def sequence_version
  @sequence_version
end

#speciesObject

Organism species (String). For example, “Escherichia coli”.



210
211
212
# File 'lib/bio/sequence.rb', line 210

def species
  @species
end

#strandednessObject

Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.



169
170
171
# File 'lib/bio/sequence.rb', line 169

def strandedness
  @strandedness
end

#topologyObject

Topology (String). “circular”, “linear”, or nil.



165
166
167
# File 'lib/bio/sequence.rb', line 165

def topology
  @topology
end

Class Method Details

.adapter(source_data, adapter_module) ⇒ Object

Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).

Creates a new Bio::Sequence object from database data with an adapter module.



442
443
444
445
446
447
448
449
450
# File 'lib/bio/sequence.rb', line 442

def self.adapter(source_data, adapter_module)
  biosequence = self.new(nil)
  biosequence.instance_eval {
    remove_instance_variable(:@seq)
    @source_data = source_data
  }
  biosequence.extend(adapter_module)
  biosequence
end

.auto(str) ⇒ Object

Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)

s = Bio::Sequence.auto('atgc')
puts s.seq.class                        #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



262
263
264
265
266
# File 'lib/bio/sequence.rb', line 262

def self.auto(str)
  seq = self.new(str)
  seq.auto
  return seq
end

.guess(str, *args) ⇒ Object

Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

puts .guess('atgc')        #=> Bio::Sequence::NA

There are three optional parameters: ‘threshold`, `length`, and `index`.

The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

puts Bio::Sequence.guess('atgcatgcqq')      #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA

The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)

The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

puts Bio::Sequence.guess('-----atgcc')             #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



360
361
362
# File 'lib/bio/sequence.rb', line 360

def self.guess(str, *args)
  self.new(str).guess(*args)
end

.input(str, format = nil) ⇒ Object

Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.)

s = Bio::Sequence.input(str)

Arguments:

  • (required) str: string

  • (optional) format: format specification (class or nil)

Returns

Bio::Sequence object



415
416
417
418
419
420
421
422
423
# File 'lib/bio/sequence.rb', line 415

def self.input(str, format = nil)
  if format then
    klass = format
  else
    klass = Bio::FlatFile::AutoDetect.default.autodetect(str)
  end
  obj = klass.new(str)
  obj.to_biosequence
end

.read(str, format = nil) ⇒ Object

alias of Bio::Sequence.input



426
427
428
# File 'lib/bio/sequence.rb', line 426

def self.read(str, format = nil)
  input(str, format)
end

Instance Method Details

#aaObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.aa
puts s.seq.class                        #=> Bio::Sequence::AA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('RRLE')
s.aa

Returns

Bio::Sequence::AA



401
402
403
404
# File 'lib/bio/sequence.rb', line 401

def aa
  @seq = AA.new(seq)
  @moltype = AA
end

#accessionsObject

accession numbers of the sequence

Returns

Array of String



433
434
435
# File 'lib/bio/sequence.rb', line 433

def accessions
  [ primary_accession, secondary_accessions ].flatten.compact
end

#autoObject

Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.auto
puts s.seq.class                        #=> Bio::Sequence::NA

Returns

Bio::Sequence::NA/AA object



243
244
245
246
247
248
249
250
# File 'lib/bio/sequence.rb', line 243

def auto
  @moltype = guess
  if @moltype == NA
    @seq = NA.new(seq)
  else
    @seq = AA.new(seq)
  end
end

#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object

Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

s = Bio::Sequence.new('atgc')
puts s.guess                            #=> Bio::Sequence::NA

There are three parameters: ‘threshold`, `length`, and `index`.

The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

s = Bio::Sequence.new('atgcatgcqq')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.8)                       #=> Bio::Sequence::AA
puts s.guess(0.7)                       #=> Bio::Sequence::NA

The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000)  # limit the guess to the first 1000 positions

The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

s = Bio::Sequence.new('-----atgcc')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5)               #=> Bio::Sequence::NA

Arguments:

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# File 'lib/bio/sequence.rb', line 307

def guess(threshold = 0.9, length = 10000, index = 0)
  str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common
  cmp = str.composition

  bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] +
          cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u']

  total = str.length - cmp['N'] - cmp['n']

  if bases.to_f / total > threshold
    return NA
  else
    return AA
  end
end

#naObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('RRLE')
puts s.seq.class                        #=> String
s.na
puts s.seq.class                        #=> Bio::Sequence::NA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('atgc')
s.na

Returns

Bio::Sequence::NA



380
381
382
383
# File 'lib/bio/sequence.rb', line 380

def na
  @seq = NA.new(seq)
  @moltype = NA
end

#to_sObject Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence

Returns

String object



32
33
34
# File 'lib/bio/sequence/compat.rb', line 32

def to_s
  String.new(self.seq)
end