Class: Bio::EuPathDB::FastaParser
- Inherits:
-
Object
- Object
- Bio::EuPathDB::FastaParser
- Defined in:
- lib/eupathdb_fasta.rb
Overview
Looks like EuPathDB databases have settled on something like >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 where the species name differs but the rest is mostly constant
Instance Attribute Summary collapse
-
#species_name ⇒ Object
Returns the value of attribute species_name.
Instance Method Summary collapse
-
#each ⇒ Object
Enumerate through fasta file entries.
-
#initialize(species_name, filename) ⇒ FastaParser
constructor
The species name is what should show up in the 2nd bracket, so something like ‘Toxoplasma_gondii_ME49’ for >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 for instance.
-
#next_entry ⇒ Object
Return the entry in the fasta file, or nil if there is no more or the fasta file could not be opened correctly.
- #parse_name(definition) ⇒ Object
Constructor Details
#initialize(species_name, filename) ⇒ FastaParser
The species name is what should show up in the 2nd bracket, so something like ‘Toxoplasma_gondii_ME49’ for >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 for instance
15 16 17 18 |
# File 'lib/eupathdb_fasta.rb', line 15 def initialize(species_name, filename) @species_name = species_name @filename = filename end |
Instance Attribute Details
#species_name ⇒ Object
Returns the value of attribute species_name.
9 10 11 |
# File 'lib/eupathdb_fasta.rb', line 9 def species_name @species_name end |
Instance Method Details
#each ⇒ Object
Enumerate through fasta file entries
21 22 23 24 25 26 27 28 |
# File 'lib/eupathdb_fasta.rb', line 21 def each @flat = Bio::FlatFile.open(Bio::FastaFormat, @filename) n = next_entry while !n.nil? yield n n = next_entry end end |
#next_entry ⇒ Object
Return the entry in the fasta file, or nil if there is no more or the fasta file could not be opened correctly.
32 33 34 35 36 37 38 39 40 |
# File 'lib/eupathdb_fasta.rb', line 32 def next_entry return nil if !@flat n = @flat.next_entry return nil if !n s = parse_name(n.definition) s.sequence = n.seq return s end |
#parse_name(definition) ⇒ Object
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/eupathdb_fasta.rb', line 42 def parse_name(definition) s = FastaAnnotation.new regex = /^(\S+)\|(.*?) \| organism=#{@species_name} \| product=(.*?) \| location=(.*) \| length=\d+$/ matches = definition.match(regex) if !matches raise Exception, "Definition line has unexpected format: `#{definition}'. Trying to match this line to the regular expression `#{regex.inspect}'" end matches2 = matches[4].match(/^(.+?)\:/) if !matches2 raise ParseException, "Definition line has unexpected scaffold format: #{matches[4]}" end s.sequencing_centre = matches[1] s.scaffold = matches2[1] s.gene_id = matches[2] s.annotation = matches[3] return s end |