Class: Bio::Sequence
- Includes:
- Format
- Defined in:
- lib/bio/sequence.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/generic.rb
Overview
DESCRIPTION
Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).
USAGE
# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')
# Print it out
puts dna.to_s
puts aa.to_s
# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)
# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]
# Print in FASTA format
puts dna.output(:fasta)
# Print all codons
dna.window_search(3,3) do |codon|
puts codon
end
# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")
# Convert a sequence containing ambiguity codes into a
# regular expression you can use for subsequent searching
puts aa.to_re
# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent
Defined Under Namespace
Modules: Adapter, Common, Format Classes: AA, DBLink, Generic, NA
Instance Attribute Summary collapse
-
#classification ⇒ Object
(also: #taxonomy)
Organism classification, taxonomic classification of the source organism.
-
#comments ⇒ Object
Comments (String or an Array of String).
-
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1.
-
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String).
-
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String).
-
#dblinks ⇒ Object
Links to other database entries.
-
#definition ⇒ Object
A String with a description of the sequence (String).
-
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2.
-
#entry_id ⇒ Object
The sequence identifier (String).
-
#entry_version ⇒ Object
Version of the entry (String or Integer).
-
#features ⇒ Object
Features (An Array of Bio::Feature objects).
-
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String).
-
#keywords ⇒ Object
Keywords (An Array of String).
-
#molecule_type ⇒ Object
molecular type (String).
-
#moltype ⇒ Object
Bio::Sequence::NA/AA.
-
#organelle ⇒ Object
(not well supported) Organelle information (String).
-
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects).
-
#primary_accession ⇒ Object
Primary accession number (String).
-
#references ⇒ Object
References (An Array of Bio::Reference objects).
-
#release_created ⇒ Object
Release information when created (String).
-
#release_modified ⇒ Object
Release information when last-modified (String).
-
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String).
-
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String.
-
#sequence_version ⇒ Object
Version number of the sequence (String or Integer).
-
#species ⇒ Object
Organism species (String).
-
#strandedness ⇒ Object
Strandedness (String).
-
#topology ⇒ Object
Topology (String).
Class Method Summary collapse
-
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly.
-
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA).
-
.guess(str, *args) ⇒ Object
Guess the class of a given sequence.
-
.input(str, format = nil) ⇒ Object
Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.).
-
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input.
Instance Method Summary collapse
-
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#accessions ⇒ Object
accession numbers of the sequence.
-
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess.
-
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence.
-
#initialize(str) ⇒ Sequence
constructor
Create a new Bio::Sequence object.
-
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object.
-
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#to_s ⇒ Object
(also: #to_str)
Return sequence as String.
Methods included from Format
Constructor Details
#initialize(str) ⇒ Sequence
Create a new Bio::Sequence object
s = Bio::Sequence.new('atgc')
puts s #=> 'atgc'
Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string
puts s.seq.class #=> String
See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
94 95 96 |
# File 'lib/bio/sequence.rb', line 94 def initialize(str) @seq = str end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/bio/sequence.rb', line 100 def method_missing(sym, *args, &block) #:nodoc: begin seq.__send__(sym, *args, &block) rescue NoMethodError => evar lineno = __LINE__ - 2 file = __FILE__ bt_here = [ "#{file}:#{lineno}:in \`__send__\'", "#{file}:#{lineno}:in \`method_missing\'" ] if bt_here == evar.backtrace[0, 2] then bt = evar.backtrace[2..-1] evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}") evar.set_backtrace(bt) end #p lineno #p file #p bt_here #p evar.backtrace raise(evar) end end |
Instance Attribute Details
#classification ⇒ Object Also known as: taxonomy
Organism classification, taxonomic classification of the source organism. (Array of String)
214 215 216 |
# File 'lib/bio/sequence.rb', line 214 def classification @classification end |
#comments ⇒ Object
Comments (String or an Array of String)
137 138 139 |
# File 'lib/bio/sequence.rb', line 137 def comments @comments end |
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1
176 177 178 |
# File 'lib/bio/sequence.rb', line 176 def data_class @data_class end |
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String)
189 190 191 |
# File 'lib/bio/sequence.rb', line 189 def date_created @date_created end |
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String)
192 193 194 |
# File 'lib/bio/sequence.rb', line 192 def date_modified @date_modified end |
#dblinks ⇒ Object
Links to other database entries. (An Array of Bio::Sequence::DBLink objects)
144 145 146 |
# File 'lib/bio/sequence.rb', line 144 def dblinks @dblinks end |
#definition ⇒ Object
A String with a description of the sequence (String)
128 129 130 |
# File 'lib/bio/sequence.rb', line 128 def definition @definition end |
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2
180 181 182 |
# File 'lib/bio/sequence.rb', line 180 def division @division end |
#entry_id ⇒ Object
The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.
125 126 127 |
# File 'lib/bio/sequence.rb', line 125 def entry_id @entry_id end |
#entry_version ⇒ Object
Version of the entry (String or Integer). Unlike sequence_version
, entry_version
is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.
207 208 209 |
# File 'lib/bio/sequence.rb', line 207 def entry_version @entry_version end |
#features ⇒ Object
Features (An Array of Bio::Feature objects)
131 132 133 |
# File 'lib/bio/sequence.rb', line 131 def features @features end |
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.
223 224 225 |
# File 'lib/bio/sequence.rb', line 223 def id_namespace @id_namespace end |
#keywords ⇒ Object
Keywords (An Array of String)
140 141 142 |
# File 'lib/bio/sequence.rb', line 140 def keywords @keywords end |
#molecule_type ⇒ Object
molecular type (String). “DNA” or “RNA” for nucleotide sequence.
172 173 174 |
# File 'lib/bio/sequence.rb', line 172 def molecule_type @molecule_type end |
#moltype ⇒ Object
Bio::Sequence::NA/AA
147 148 149 |
# File 'lib/bio/sequence.rb', line 147 def moltype @moltype end |
#organelle ⇒ Object
(not well supported) Organelle information (String).
218 219 220 |
# File 'lib/bio/sequence.rb', line 218 def organelle @organelle end |
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks
should be used.
231 232 233 |
# File 'lib/bio/sequence.rb', line 231 def other_seqids @other_seqids end |
#primary_accession ⇒ Object
Primary accession number (String)
183 184 185 |
# File 'lib/bio/sequence.rb', line 183 def primary_accession @primary_accession end |
#references ⇒ Object
References (An Array of Bio::Reference objects)
134 135 136 |
# File 'lib/bio/sequence.rb', line 134 def references @references end |
#release_created ⇒ Object
Release information when created (String)
195 196 197 |
# File 'lib/bio/sequence.rb', line 195 def release_created @release_created end |
#release_modified ⇒ Object
Release information when last-modified (String)
198 199 200 |
# File 'lib/bio/sequence.rb', line 198 def release_modified @release_modified end |
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String)
186 187 188 |
# File 'lib/bio/sequence.rb', line 186 def secondary_accessions @secondary_accessions end |
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String
151 152 153 |
# File 'lib/bio/sequence.rb', line 151 def seq @seq end |
#sequence_version ⇒ Object
Version number of the sequence (String or Integer). Unlike entry_version
, sequence_version
will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.
162 163 164 |
# File 'lib/bio/sequence.rb', line 162 def sequence_version @sequence_version end |
#species ⇒ Object
Organism species (String). For example, “Escherichia coli”.
210 211 212 |
# File 'lib/bio/sequence.rb', line 210 def species @species end |
#strandedness ⇒ Object
Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.
169 170 171 |
# File 'lib/bio/sequence.rb', line 169 def strandedness @strandedness end |
#topology ⇒ Object
Topology (String). “circular”, “linear”, or nil.
165 166 167 |
# File 'lib/bio/sequence.rb', line 165 def topology @topology end |
Class Method Details
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).
Creates a new Bio::Sequence object from database data with an adapter module.
442 443 444 445 446 447 448 449 450 |
# File 'lib/bio/sequence.rb', line 442 def self.adapter(source_data, adapter_module) biosequence = self.new(nil) biosequence.instance_eval { remove_instance_variable(:@seq) @source_data = source_data } biosequence.extend(adapter_module) biosequence end |
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)
s = Bio::Sequence.auto('atgc')
puts s.seq.class #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
262 263 264 265 266 |
# File 'lib/bio/sequence.rb', line 262 def self.auto(str) seq = self.new(str) seq.auto return seq end |
.guess(str, *args) ⇒ Object
Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
puts .guess('atgc') #=> Bio::Sequence::NA
There are three optional parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
puts Bio::Sequence.guess('atgcatgcqq') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
puts Bio::Sequence.guess('-----atgcc') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
360 361 362 |
# File 'lib/bio/sequence.rb', line 360 def self.guess(str, *args) self.new(str).guess(*args) end |
.input(str, format = nil) ⇒ Object
415 416 417 418 419 420 421 422 423 |
# File 'lib/bio/sequence.rb', line 415 def self.input(str, format = nil) if format then klass = format else klass = Bio::FlatFile::AutoDetect.default.autodetect(str) end obj = klass.new(str) obj.to_biosequence end |
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input
426 427 428 |
# File 'lib/bio/sequence.rb', line 426 def self.read(str, format = nil) input(str, format) end |
Instance Method Details
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.aa
puts s.seq.class #=> Bio::Sequence::AA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('RRLE')
s.aa
- Returns
-
Bio::Sequence::AA
401 402 403 404 |
# File 'lib/bio/sequence.rb', line 401 def aa @seq = AA.new(seq) @moltype = AA end |
#accessions ⇒ Object
accession numbers of the sequence
- Returns
-
Array of String
433 434 435 |
# File 'lib/bio/sequence.rb', line 433 def accessions [ primary_accession, secondary_accessions ].flatten.compact end |
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.auto
puts s.seq.class #=> Bio::Sequence::NA
- Returns
-
Bio::Sequence::NA/AA object
243 244 245 246 247 248 249 250 |
# File 'lib/bio/sequence.rb', line 243 def auto @moltype = guess if @moltype == NA @seq = NA.new(seq) else @seq = AA.new(seq) end end |
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
s = Bio::Sequence.new('atgc')
puts s.guess #=> Bio::Sequence::NA
There are three parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
s = Bio::Sequence.new('atgcatgcqq')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.8) #=> Bio::Sequence::AA
puts s.guess(0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000) # limit the guess to the first 1000 positions
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
s = Bio::Sequence.new('-----atgcc')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 |
# File 'lib/bio/sequence.rb', line 307 def guess(threshold = 0.9, length = 10000, index = 0) str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common cmp = str.composition bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] + cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u'] total = str.length - cmp['N'] - cmp['n'] if bases.to_f / total > threshold return NA else return AA end end |
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('RRLE')
puts s.seq.class #=> String
s.na
puts s.seq.class #=> Bio::Sequence::NA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('atgc')
s.na
- Returns
-
Bio::Sequence::NA
380 381 382 383 |
# File 'lib/bio/sequence.rb', line 380 def na @seq = NA.new(seq) @moltype = NA end |