Class: Bio::Sequence
- Includes:
- Format, SequenceMasker
- Defined in:
- lib/bio/sequence.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/generic.rb,
lib/bio/sequence/quality_score.rb,
lib/bio/sequence/sequence_masker.rb
Overview
DESCRIPTION
Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).
USAGE
# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')
# Print it out
puts dna.to_s
puts aa.to_s
# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)
# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]
# Print in FASTA format
puts dna.output(:fasta)
# Print all codons
dna.window_search(3,3) do |codon|
puts codon
end
# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")
# Convert a sequence containing ambiguity codes into a
# regular expression you can use for subsequent searching
puts aa.to_re
# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent
Defined Under Namespace
Modules: Adapter, Common, Format, QualityScore, SequenceMasker Classes: AA, DBLink, Generic, NA
Instance Attribute Summary collapse
-
#classification ⇒ Object
(also: #taxonomy)
Organism classification, taxonomic classification of the source organism.
-
#comments ⇒ Object
Comments (String or an Array of String).
-
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1.
-
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String).
-
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String).
-
#dblinks ⇒ Object
Links to other database entries.
-
#definition ⇒ Object
A String with a description of the sequence (String).
-
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2.
-
#entry_id ⇒ Object
The sequence identifier (String).
-
#entry_version ⇒ Object
Version of the entry (String or Integer).
-
#error_probabilities ⇒ Object
Error probabilities of the bases/residues in the sequence.
-
#features ⇒ Object
Features (An Array of Bio::Feature objects).
-
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String).
-
#keywords ⇒ Object
Keywords (An Array of String).
-
#molecule_type ⇒ Object
molecular type (String).
-
#moltype ⇒ Object
Bio::Sequence::NA/AA.
-
#organelle ⇒ Object
(not well supported) Organelle information (String).
-
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects).
-
#primary_accession ⇒ Object
Primary accession number (String).
-
#quality_score_type ⇒ Object
The meaning (calculation method) of the quality scores stored in the
quality_scores
attribute. -
#quality_scores ⇒ Object
Quality scores of the bases/residues in the sequence.
-
#references ⇒ Object
References (An Array of Bio::Reference objects).
-
#release_created ⇒ Object
Release information when created (String).
-
#release_modified ⇒ Object
Release information when last-modified (String).
-
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String).
-
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String.
-
#sequence_version ⇒ Object
Version number of the sequence (String or Integer).
-
#species ⇒ Object
Organism species (String).
-
#strandedness ⇒ Object
Strandedness (String).
-
#topology ⇒ Object
Topology (String).
Class Method Summary collapse
-
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly.
-
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA).
-
.guess(str, *args) ⇒ Object
Guess the class of a given sequence.
-
.input(str, format = nil) ⇒ Object
Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.).
-
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input.
Instance Method Summary collapse
-
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#accessions ⇒ Object
accession numbers of the sequence.
-
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess.
-
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence.
-
#initialize(str) ⇒ Sequence
constructor
Create a new Bio::Sequence object.
-
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object.
-
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object.
-
#to_s ⇒ Object
(also: #to_str)
Return sequence as String.
Methods included from SequenceMasker
#mask_with_enumerator, #mask_with_error_probability, #mask_with_quality_score
Methods included from Format
#list_output_formats, #output, #output_fasta
Constructor Details
#initialize(str) ⇒ Sequence
Create a new Bio::Sequence object
s = Bio::Sequence.new('atgc')
puts s #=> 'atgc'
Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string
puts s.seq.class #=> String
See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
99 100 101 |
# File 'lib/bio/sequence.rb', line 99 def initialize(str) @seq = str end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(sym, *args, &block) ⇒ Object
Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/bio/sequence.rb', line 105 def method_missing(sym, *args, &block) #:nodoc: begin seq.__send__(sym, *args, &block) rescue NoMethodError => evar lineno = __LINE__ - 2 file = __FILE__ bt_here = [ "#{file}:#{lineno}:in \`__send__\'", "#{file}:#{lineno}:in \`method_missing\'" ] if bt_here == evar.backtrace[0, 2] then bt = evar.backtrace[2..-1] evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}") evar.set_backtrace(bt) end #p lineno #p file #p bt_here #p evar.backtrace raise(evar) end end |
Instance Attribute Details
#classification ⇒ Object Also known as: taxonomy
Organism classification, taxonomic classification of the source organism. (Array of String)
235 236 237 |
# File 'lib/bio/sequence.rb', line 235 def classification @classification end |
#comments ⇒ Object
Comments (String or an Array of String)
142 143 144 |
# File 'lib/bio/sequence.rb', line 142 def comments @comments end |
#data_class ⇒ Object
Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1
197 198 199 |
# File 'lib/bio/sequence.rb', line 197 def data_class @data_class end |
#date_created ⇒ Object
Created date of the sequence entry (Date, DateTime, Time, or String)
210 211 212 |
# File 'lib/bio/sequence.rb', line 210 def date_created @date_created end |
#date_modified ⇒ Object
Last modified date of the sequence entry (Date, DateTime, Time, or String)
213 214 215 |
# File 'lib/bio/sequence.rb', line 213 def date_modified @date_modified end |
#dblinks ⇒ Object
Links to other database entries. (An Array of Bio::Sequence::DBLink objects)
149 150 151 |
# File 'lib/bio/sequence.rb', line 149 def dblinks @dblinks end |
#definition ⇒ Object
A String with a description of the sequence (String)
133 134 135 |
# File 'lib/bio/sequence.rb', line 133 def definition @definition end |
#division ⇒ Object
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2
201 202 203 |
# File 'lib/bio/sequence.rb', line 201 def division @division end |
#entry_id ⇒ Object
The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.
130 131 132 |
# File 'lib/bio/sequence.rb', line 130 def entry_id @entry_id end |
#entry_version ⇒ Object
Version of the entry (String or Integer). Unlike sequence_version
, entry_version
is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.
228 229 230 |
# File 'lib/bio/sequence.rb', line 228 def entry_version @entry_version end |
#error_probabilities ⇒ Object
Error probabilities of the bases/residues in the sequence. (Array containing Float, or nil)
172 173 174 |
# File 'lib/bio/sequence.rb', line 172 def error_probabilities @error_probabilities end |
#features ⇒ Object
Features (An Array of Bio::Feature objects)
136 137 138 |
# File 'lib/bio/sequence.rb', line 136 def features @features end |
#id_namespace ⇒ Object
Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.
244 245 246 |
# File 'lib/bio/sequence.rb', line 244 def id_namespace @id_namespace end |
#keywords ⇒ Object
Keywords (An Array of String)
145 146 147 |
# File 'lib/bio/sequence.rb', line 145 def keywords @keywords end |
#molecule_type ⇒ Object
molecular type (String). “DNA” or “RNA” for nucleotide sequence.
193 194 195 |
# File 'lib/bio/sequence.rb', line 193 def molecule_type @molecule_type end |
#moltype ⇒ Object
Bio::Sequence::NA/AA
152 153 154 |
# File 'lib/bio/sequence.rb', line 152 def moltype @moltype end |
#organelle ⇒ Object
(not well supported) Organelle information (String).
239 240 241 |
# File 'lib/bio/sequence.rb', line 239 def organelle @organelle end |
#other_seqids ⇒ Object
Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks
should be used.
252 253 254 |
# File 'lib/bio/sequence.rb', line 252 def other_seqids @other_seqids end |
#primary_accession ⇒ Object
Primary accession number (String)
204 205 206 |
# File 'lib/bio/sequence.rb', line 204 def primary_accession @primary_accession end |
#quality_score_type ⇒ Object
The meaning (calculation method) of the quality scores stored in the quality_scores
attribute. Maybe one of :phred, :solexa, or nil.
Note that if it is nil, and error_probabilities
is empty, some methods implicitly assumes that it is :phred (PHRED score).
168 169 170 |
# File 'lib/bio/sequence.rb', line 168 def quality_score_type @quality_score_type end |
#quality_scores ⇒ Object
Quality scores of the bases/residues in the sequence. (Array containing Integer, or nil)
160 161 162 |
# File 'lib/bio/sequence.rb', line 160 def quality_scores @quality_scores end |
#references ⇒ Object
References (An Array of Bio::Reference objects)
139 140 141 |
# File 'lib/bio/sequence.rb', line 139 def references @references end |
#release_created ⇒ Object
Release information when created (String)
216 217 218 |
# File 'lib/bio/sequence.rb', line 216 def release_created @release_created end |
#release_modified ⇒ Object
Release information when last-modified (String)
219 220 221 |
# File 'lib/bio/sequence.rb', line 219 def release_modified @release_modified end |
#secondary_accessions ⇒ Object
Secondary accession numbers (Array of String)
207 208 209 |
# File 'lib/bio/sequence.rb', line 207 def secondary_accessions @secondary_accessions end |
#seq ⇒ Object
The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String
156 157 158 |
# File 'lib/bio/sequence.rb', line 156 def seq @seq end |
#sequence_version ⇒ Object
Version number of the sequence (String or Integer). Unlike entry_version
, sequence_version
will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.
183 184 185 |
# File 'lib/bio/sequence.rb', line 183 def sequence_version @sequence_version end |
#species ⇒ Object
Organism species (String). For example, “Escherichia coli”.
231 232 233 |
# File 'lib/bio/sequence.rb', line 231 def species @species end |
#strandedness ⇒ Object
Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.
190 191 192 |
# File 'lib/bio/sequence.rb', line 190 def strandedness @strandedness end |
#topology ⇒ Object
Topology (String). “circular”, “linear”, or nil.
186 187 188 |
# File 'lib/bio/sequence.rb', line 186 def topology @topology end |
Class Method Details
.adapter(source_data, adapter_module) ⇒ Object
Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).
Creates a new Bio::Sequence object from database data with an adapter module.
463 464 465 466 467 468 469 470 471 |
# File 'lib/bio/sequence.rb', line 463 def self.adapter(source_data, adapter_module) biosequence = self.new(nil) biosequence.instance_eval { remove_instance_variable(:@seq) @source_data = source_data } biosequence.extend(adapter_module) biosequence end |
.auto(str) ⇒ Object
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)
s = Bio::Sequence.auto('atgc')
puts s.seq.class #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
- Returns
-
Bio::Sequence object
283 284 285 286 287 |
# File 'lib/bio/sequence.rb', line 283 def self.auto(str) seq = self.new(str) seq.auto return seq end |
.guess(str, *args) ⇒ Object
Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
puts .guess('atgc') #=> Bio::Sequence::NA
There are three optional parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
puts Bio::Sequence.guess('atgcatgcqq') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
puts Bio::Sequence.guess('-----atgcc') #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(required) str: String or Bio::Sequence::NA/AA object
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
381 382 383 |
# File 'lib/bio/sequence.rb', line 381 def self.guess(str, *args) self.new(str).guess(*args) end |
.input(str, format = nil) ⇒ Object
436 437 438 439 440 441 442 443 444 |
# File 'lib/bio/sequence.rb', line 436 def self.input(str, format = nil) if format then klass = format else klass = Bio::FlatFile::AutoDetect.default.autodetect(str) end obj = klass.new(str) obj.to_biosequence end |
.read(str, format = nil) ⇒ Object
alias of Bio::Sequence.input
447 448 449 |
# File 'lib/bio/sequence.rb', line 447 def self.read(str, format = nil) input(str, format) end |
Instance Method Details
#aa ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.aa
puts s.seq.class #=> Bio::Sequence::AA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('RRLE')
s.aa
- Returns
-
Bio::Sequence::AA
422 423 424 425 |
# File 'lib/bio/sequence.rb', line 422 def aa @seq = AA.new(seq) @moltype = AA end |
#accessions ⇒ Object
accession numbers of the sequence
- Returns
-
Array of String
454 455 456 |
# File 'lib/bio/sequence.rb', line 454 def accessions [ primary_accession, secondary_accessions ].flatten.compact end |
#auto ⇒ Object
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.
s = Bio::Sequence.new('atgc')
puts s.seq.class #=> String
s.auto
puts s.seq.class #=> Bio::Sequence::NA
- Returns
-
Bio::Sequence::NA/AA object
264 265 266 267 268 269 270 271 |
# File 'lib/bio/sequence.rb', line 264 def auto @moltype = guess if @moltype == NA @seq = NA.new(seq) else @seq = AA.new(seq) end end |
#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object
Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.
s = Bio::Sequence.new('atgc')
puts s.guess #=> Bio::Sequence::NA
There are three parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.
s = Bio::Sequence.new('atgcatgcqq')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.8) #=> Bio::Sequence::AA
puts s.guess(0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000) # limit the guess to the first 1000 positions
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
s = Bio::Sequence.new('-----atgcc')
puts s.guess #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5) #=> Bio::Sequence::NA
Arguments:
-
(optional) threshold: Float in range 0,1 (default 0.9)
-
(optional) length: Fixnum (default 10000)
-
(optional) index: Fixnum (default 1)
- Returns
-
Bio::Sequence::NA/AA
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
# File 'lib/bio/sequence.rb', line 328 def guess(threshold = 0.9, length = 10000, index = 0) str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common cmp = str.composition bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] + cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u'] total = str.length - cmp['N'] - cmp['n'] if bases.to_f / total > threshold return NA else return AA end end |
#na ⇒ Object
Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!
s = Bio::Sequence.new('RRLE')
puts s.seq.class #=> String
s.na
puts s.seq.class #=> Bio::Sequence::NA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s = Bio::Sequence.new('atgc')
s.na
- Returns
-
Bio::Sequence::NA
401 402 403 404 |
# File 'lib/bio/sequence.rb', line 401 def na @seq = NA.new(seq) @moltype = NA end |