Class: Bio::FastaFormat
Overview
Treats a FASTA formatted entry, such as:
>id and/or some comments <== definition line
ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines
ATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGC
The precedent ‘>’ can be omitted and the trailing ‘>’ will be removed automatically.
Examples
fasta_string = <<END_OF_STRING
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]
MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI
VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ
NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP
IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP
INRISARRAAIHPYFQES
END_OF_STRING
f = Bio::FastaFormat.new(fasta_string)
f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+
# MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+
# VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+
# NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+
# IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+
# INRISARRAAIHPYFQES"
Methods related to the name of the sequence
A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.
f.entry_id #=> "gi|398365175"
f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]"
f.identifiers #=> Bio::FastaDefline instance
f.accession #=> "NP_009718"
f.accessions #=> ["NP_009718"]
f.acc_version #=> "NP_009718.3"
f.comment #=> nil
Methods related to the actual sequence
f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n"
f.length #=> 298
f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4}
f.aalen #=> 298
A less structured fasta entry
f.entry #=> ">abc 123 456\nASDF"
f.entry_id #=> "abc"
f.definition #=> "abc 123 456"
f.comment #=> nil
f.accession #=> nil
f.accessions #=> []
f.acc_version #=> nil
f.seq #=> "ASDF"
f.data #=> "\nASDF\n"
f.length #=> 4
f.aaseq #=> "ASDF"
f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1}
f.aalen #=> 4
References
-
FASTA format (WikiPedia) en.wikipedia.org/wiki/FASTA_format
Direct Known Subclasses
Constant Summary collapse
- DELIMITER =
Entry delimiter in flatfile text.
RS = "\n>"
- DELIMITER_OVERRUN =
(Integer) excess read size included in DELIMITER.
1
Instance Attribute Summary collapse
-
#data ⇒ Object
The seuqnce lines in text.
-
#definition ⇒ Object
The comment line of the FASTA formatted data.
-
#entry_overrun ⇒ Object
readonly
Returns the value of attribute entry_overrun.
Instance Method Summary collapse
-
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
-
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
-
#acc_version ⇒ Object
Returns accession number with version.
-
#accession ⇒ Object
Returns an accession number.
-
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers.
-
#comment ⇒ Object
Returns comments.
-
#entry ⇒ Object
(also: #to_s)
Returns the stored one entry as a FASTA format.
-
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier.
-
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number.
-
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs.
-
#initialize(str) ⇒ FastaFormat
constructor
Stores the comment and sequence information from one entry of the FASTA format string.
-
#length ⇒ Object
Returns sequence length.
-
#locus ⇒ Object
Returns locus.
-
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
-
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
-
#query(factory) ⇒ Object
(also: #fasta, #blast)
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
-
#seq ⇒ Object
Returns a joined sequence line as a String.
-
#to_biosequence ⇒ Object
(also: #to_seq)
Returns sequence as a Bio::Sequence object.
Methods inherited from DB
#exists?, #fetch, #get, open, #tags
Constructor Details
#initialize(str) ⇒ FastaFormat
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
131 132 133 134 135 136 |
# File 'lib/bio/db/fasta.rb', line 131 def initialize(str) @definition = str[/.*/].sub(/^>/, '').strip # 1st line @data = str.sub(/.*/, '') # rests @data.sub!(/^>.*/m, '') # remove trailing entries for sure @entry_overrun = $& end |
Instance Attribute Details
#data ⇒ Object
The seuqnce lines in text.
124 125 126 |
# File 'lib/bio/db/fasta.rb', line 124 def data @data end |
#definition ⇒ Object
The comment line of the FASTA formatted data.
121 122 123 |
# File 'lib/bio/db/fasta.rb', line 121 def definition @definition end |
#entry_overrun ⇒ Object (readonly)
Returns the value of attribute entry_overrun.
126 127 128 |
# File 'lib/bio/db/fasta.rb', line 126 def entry_overrun @entry_overrun end |
Instance Method Details
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
221 222 223 |
# File 'lib/bio/db/fasta.rb', line 221 def aalen self.aaseq.length end |
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
216 217 218 |
# File 'lib/bio/db/fasta.rb', line 216 def aaseq Sequence::AA.new(seq) end |
#acc_version ⇒ Object
Returns accession number with version.
277 278 279 |
# File 'lib/bio/db/fasta.rb', line 277 def acc_version identifiers.acc_version end |
#accession ⇒ Object
Returns an accession number.
265 266 267 |
# File 'lib/bio/db/fasta.rb', line 265 def accession identifiers.accession end |
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers. It returns an array of strings.
272 273 274 |
# File 'lib/bio/db/fasta.rb', line 272 def accessions identifiers.accessions end |
#comment ⇒ Object
Returns comments.
195 196 197 198 |
# File 'lib/bio/db/fasta.rb', line 195 def comment seq @comment end |
#entry ⇒ Object Also known as: to_s
Returns the stored one entry as a FASTA format. (same as to_s)
139 140 141 |
# File 'lib/bio/db/fasta.rb', line 139 def entry @entry = ">#{@definition}\n#{@data.strip}\n" end |
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier. It returns a string.
251 252 253 |
# File 'lib/bio/db/fasta.rb', line 251 def entry_id identifiers.entry_id end |
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
260 261 262 |
# File 'lib/bio/db/fasta.rb', line 260 def gi identifiers.gi end |
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.
241 242 243 244 245 246 |
# File 'lib/bio/db/fasta.rb', line 241 def identifiers unless defined?(@ids) then @ids = FastaDefline.new(@definition) end @ids end |
#length ⇒ Object
Returns sequence length.
201 202 203 |
# File 'lib/bio/db/fasta.rb', line 201 def length seq.length end |
#locus ⇒ Object
Returns locus.
282 283 284 |
# File 'lib/bio/db/fasta.rb', line 282 def locus identifiers.locus end |
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
211 212 213 |
# File 'lib/bio/db/fasta.rb', line 211 def nalen self.naseq.length end |
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
206 207 208 |
# File 'lib/bio/db/fasta.rb', line 206 def naseq Sequence::NA.new(seq) end |
#query(factory) ⇒ Object Also known as: fasta, blast
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby
require 'bio'
factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
p entry.definition
result = entry.fasta(factory)
result.each do |hit|
print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
p hit.lap_at
end
end
162 163 164 |
# File 'lib/bio/db/fasta.rb', line 162 def query(factory) factory.query(entry) end |
#seq ⇒ Object
Returns a joined sequence line as a String.
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/bio/db/fasta.rb', line 169 def seq unless defined?(@seq) unless /\A\s*^\#/ =~ @data then @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up else a = @data.split(/(^\#.*$)/) i = 0 cmnt = {} s = [] a.each do |x| if /^# ?(.*)$/ =~ x then cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 else x.tr!(" \t\r\n0-9", '') # lazy clean up i += x.length s << x end end @comment = cmnt @seq = Bio::Sequence::Generic.new(s.join('')) end end @seq end |
#to_biosequence ⇒ Object Also known as: to_seq
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
232 233 234 |
# File 'lib/bio/db/fasta.rb', line 232 def to_biosequence Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat) end |