Class: Bio::Sequence
- Includes:
- Format, SequenceMasker
- Defined in:
- lib/bio/sequence.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/generic.rb,
lib/bio/sequence/quality_score.rb,
lib/bio/sequence/sequence_masker.rb
Overview
DESCRIPTION
Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).
USAGE
# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')
# Print it out
puts dna.to_s
puts aa.to_s
# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)
# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]
# Print in FASTA format
puts dna.output(:fasta)
# Print all codons
dna.window_search(3,3) do |codon|
puts codon
end
# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")
# Convert a sequence containing ambiguity codes into a
# regular expression you can use for subsequent searching
puts aa.to_re
# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent
Defined Under Namespace
Modules: Adapter, Common, Format, QualityScore, SequenceMasker Classes: AA, DBLink, Generic, NA
Instance Attribute Summary collapse
-
#classification ⇒ Object
(also: #taxonomy)
Organism classification, taxonomic classification of the source organism.
-
#comments ⇒ Object
Comments (String or an Array of String).
-
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1.
-
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String).
-
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String).
-
#dblinks ⇒ Object
Links to other database entries.
-
#definition ⇒ Object
A String with a description of the sequence (String).
-
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2.
-
#entry_id ⇒ Object
The sequence identifier (String).
-
#entry_version ⇒ Object
Version of the entry (String or Integer).
-
#error_probabilities ⇒ Object
Error probabilities of the bases/residues in the sequence.
-
#features ⇒ Object
Features (An Array of Bio::Feature objects).
-
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String).
-
#keywords ⇒ Object
Keywords (An Array of String).
-
#molecule_type ⇒ Object
molecular type (String).
-
#moltype ⇒ Object
Bio::Sequence::NA/AA.
-
#organelle ⇒ Object
(not well supported) Organelle information (String).
-
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects).
-
#primary_accession ⇒ Object
Primary accession number (String).
-
#quality_score_type ⇒ Object
The meaning (calculation method) of the quality scores stored in the
quality_scoresattribute. -
#quality_scores ⇒ Object
Quality scores of the bases/residues in the sequence.
-
#references ⇒ Object
References (An Array of Bio::Reference objects).
-
#release_created ⇒ Object
Release information when created (String).
-
#release_modified ⇒ Object
Release information when last-modified (String).
-
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String).
-
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String.
-
#sequence_version ⇒ Object
Version number of the sequence (String or Integer).
-
#species ⇒ Object
Organism species (String).
-
#strandedness ⇒ Object
Strandedness (String).
-
#topology ⇒ Object
Topology (String).
Class Method Summary collapse
-
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly.
-
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA).
-
.guess(str, *args) ⇒ Object
Guess the class of a given sequence.
-
.input(str, format = nil) ⇒ Object
Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.).
-
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input.
Instance Method Summary collapse
-
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#accessions ⇒ Object
accession numbers of the sequence.
-
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess.
-
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence.
-
#initialize(str) ⇒ Sequence
constructor
Create a new Bio::Sequence object.
-
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object.
-
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#to_s ⇒ Object
(also: #to_str)
Return sequence as String.
Methods included from SequenceMasker
#mask_with_enumerator, #mask_with_error_probability, #mask_with_quality_score
Methods included from Format
#list_output_formats, #output, #output_fasta
Constructor Details
#initialize(str) ⇒ Sequence
Create a new Bio::Sequence object
s = Bio::Sequence.new('atgc')
puts s #=> 'atgc'
Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string
puts s.seq.class #=> String
See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
97 98 99 |
# File 'lib/bio/sequence.rb', line 97 def initialize(str) @seq = str end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/bio/sequence.rb', line 103 def method_missing(sym, *args, &block) #:nodoc: begin seq.__send__(sym, *args, &block) rescue NoMethodError => evar lineno = __LINE__ - 2 file = __FILE__ bt_here = [ "#{file}:#{lineno}:in \`__send__\'", "#{file}:#{lineno}:in \`method_missing\'" ] if bt_here == evar.backtrace[0, 2] then bt = evar.backtrace[2..-1] evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}") evar.set_backtrace(bt) end #p lineno #p file #p bt_here #p evar.backtrace raise(evar) end end |
Instance Attribute Details
#classification ⇒ Object Also known as: taxonomy
Organism classification, taxonomic classification of the source organism. (Array of String)
233 234 235 |
# File 'lib/bio/sequence.rb', line 233 def classification @classification end |
#comments ⇒ Object
Comments (String or an Array of String)
140 141 142 |
# File 'lib/bio/sequence.rb', line 140 def comments @comments end |
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1
195 196 197 |
# File 'lib/bio/sequence.rb', line 195 def data_class @data_class end |
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String)
208 209 210 |
# File 'lib/bio/sequence.rb', line 208 def date_created @date_created end |
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String)
211 212 213 |
# File 'lib/bio/sequence.rb', line 211 def date_modified @date_modified end |
#dblinks ⇒ Object
Links to other database entries. (An Array of Bio::Sequence::DBLink objects)
147 148 149 |
# File 'lib/bio/sequence.rb', line 147 def dblinks @dblinks end |
#definition ⇒ Object
A String with a description of the sequence (String)
131 132 133 |
# File 'lib/bio/sequence.rb', line 131 def definition @definition end |
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2
199 200 201 |
# File 'lib/bio/sequence.rb', line 199 def division @division end |
#entry_id ⇒ Object
The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.
128 129 130 |
# File 'lib/bio/sequence.rb', line 128 def entry_id @entry_id end |
#entry_version ⇒ Object
Version of the entry (String or Integer). Unlike sequence_version, entry_version is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.
226 227 228 |
# File 'lib/bio/sequence.rb', line 226 def entry_version @entry_version end |
#error_probabilities ⇒ Object
Error probabilities of the bases/residues in the sequence. (Array containing Float, or nil)
170 171 172 |
# File 'lib/bio/sequence.rb', line 170 def error_probabilities @error_probabilities end |
#features ⇒ Object
Features (An Array of Bio::Feature objects)
134 135 136 |
# File 'lib/bio/sequence.rb', line 134 def features @features end |
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.
242 243 244 |
# File 'lib/bio/sequence.rb', line 242 def id_namespace @id_namespace end |
#keywords ⇒ Object
Keywords (An Array of String)
143 144 145 |
# File 'lib/bio/sequence.rb', line 143 def keywords @keywords end |
#molecule_type ⇒ Object
molecular type (String). “DNA” or “RNA” for nucleotide sequence.
191 192 193 |
# File 'lib/bio/sequence.rb', line 191 def molecule_type @molecule_type end |
#moltype ⇒ Object
Bio::Sequence::NA/AA
150 151 152 |
# File 'lib/bio/sequence.rb', line 150 def moltype @moltype end |
#organelle ⇒ Object
(not well supported) Organelle information (String).
237 238 239 |
# File 'lib/bio/sequence.rb', line 237 def organelle @organelle end |
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks should be used.
250 251 252 |
# File 'lib/bio/sequence.rb', line 250 def other_seqids @other_seqids end |
#primary_accession ⇒ Object
Primary accession number (String)
202 203 204 |
# File 'lib/bio/sequence.rb', line 202 def primary_accession @primary_accession end |
#quality_score_type ⇒ Object
The meaning (calculation method) of the quality scores stored in the quality_scores attribute. Maybe one of :phred, :solexa, or nil.
Note that if it is nil, and error_probabilities is empty, some methods implicitly assumes that it is :phred (PHRED score).
166 167 168 |
# File 'lib/bio/sequence.rb', line 166 def quality_score_type @quality_score_type end |
#quality_scores ⇒ Object
Quality scores of the bases/residues in the sequence. (Array containing Integer, or nil)
158 159 160 |
# File 'lib/bio/sequence.rb', line 158 def quality_scores @quality_scores end |
#references ⇒ Object
References (An Array of Bio::Reference objects)
137 138 139 |
# File 'lib/bio/sequence.rb', line 137 def references @references end |
#release_created ⇒ Object
Release information when created (String)
214 215 216 |
# File 'lib/bio/sequence.rb', line 214 def release_created @release_created end |
#release_modified ⇒ Object
Release information when last-modified (String)
217 218 219 |
# File 'lib/bio/sequence.rb', line 217 def release_modified @release_modified end |
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String)
205 206 207 |
# File 'lib/bio/sequence.rb', line 205 def secondary_accessions @secondary_accessions end |
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String
154 155 156 |
# File 'lib/bio/sequence.rb', line 154 def seq @seq end |
#sequence_version ⇒ Object
Version number of the sequence (String or Integer). Unlike entry_version, sequence_version will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.
181 182 183 |
# File 'lib/bio/sequence.rb', line 181 def sequence_version @sequence_version end |
#species ⇒ Object
Organism species (String). For example, “Escherichia coli”.
229 230 231 |
# File 'lib/bio/sequence.rb', line 229 def species @species end |
#strandedness ⇒ Object
Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.
188 189 190 |
# File 'lib/bio/sequence.rb', line 188 def strandedness @strandedness end |
#topology ⇒ Object
Topology (String). “circular”, “linear”, or nil.
184 185 186 |
# File 'lib/bio/sequence.rb', line 184 def topology @topology end |
Class Method Details
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).
Creates a new Bio::Sequence object from database data with an adapter module.
461 462 463 464 465 466 467 468 469 |
# File 'lib/bio/sequence.rb', line 461 def self.adapter(source_data, adapter_module) biosequence = self.new(nil) biosequence.instance_eval { remove_instance_variable(:@seq) @source_data = source_data } biosequence.extend(adapter_module) biosequence end |
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)
s = Bio::Sequence.auto('atgc')
puts s.seq.class #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
281 282 283 284 285 |
# File 'lib/bio/sequence.rb', line 281 def self.auto(str) seq = self.new(str) seq.auto return seq end |
.guess(str, *args) ⇒ Object
Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
puts .guess('atgc') #=> Bio::Sequence::NA
There are three optional parameters: threshold, length, and index.
The threshold value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
puts Bio::Sequence.guess('atgcatgcqq') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA
The length value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)
The index value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
puts Bio::Sequence.guess('-----atgcc') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
379 380 381 |
# File 'lib/bio/sequence.rb', line 379 def self.guess(str, *args) self.new(str).guess(*args) end |
.input(str, format = nil) ⇒ Object
434 435 436 437 438 439 440 441 442 |
# File 'lib/bio/sequence.rb', line 434 def self.input(str, format = nil) if format then klass = format else klass = Bio::FlatFile::AutoDetect.default.autodetect(str) end obj = klass.new(str) obj.to_biosequence end |
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input
445 446 447 |
# File 'lib/bio/sequence.rb', line 445 def self.read(str, format = nil) input(str, format) end |
Instance Method Details
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.aa
puts s.seq.class #=> Bio::Sequence::AA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('RRLE')
s.aa
- Returns
-
Bio::Sequence::AA
420 421 422 423 |
# File 'lib/bio/sequence.rb', line 420 def aa @seq = AA.new(seq) @moltype = AA end |
#accessions ⇒ Object
accession numbers of the sequence
- Returns
-
Array of String
452 453 454 |
# File 'lib/bio/sequence.rb', line 452 def accessions [ primary_accession, secondary_accessions ].flatten.compact end |
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.auto
puts s.seq.class #=> Bio::Sequence::NA
- Returns
-
Bio::Sequence::NA/AA object
262 263 264 265 266 267 268 269 |
# File 'lib/bio/sequence.rb', line 262 def auto @moltype = guess if @moltype == NA @seq = NA.new(seq) else @seq = AA.new(seq) end end |
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
s = Bio::Sequence.new('atgc')
puts s.guess #=> Bio::Sequence::NA
There are three parameters: threshold, length, and index.
The threshold value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
s = Bio::Sequence.new('atgcatgcqq')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.8) #=> Bio::Sequence::AA
puts s.guess(0.7) #=> Bio::Sequence::NA
The length value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000) # limit the guess to the first 1000 positions
The index value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
s = Bio::Sequence.new('-----atgcc')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 |
# File 'lib/bio/sequence.rb', line 326 def guess(threshold = 0.9, length = 10000, index = 0) str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common cmp = str.composition bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] + cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u'] total = str.length - cmp['N'] - cmp['n'] if bases.to_f / total > threshold return NA else return AA end end |
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('RRLE')
puts s.seq.class #=> String
s.na
puts s.seq.class #=> Bio::Sequence::NA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('atgc')
s.na
- Returns
-
Bio::Sequence::NA
399 400 401 402 |
# File 'lib/bio/sequence.rb', line 399 def na @seq = NA.new(seq) @moltype = NA end |