Class: Bio::EuPathDB::FastaParser

Inherits:
Object
  • Object
show all
Defined in:
lib/eupathdb_fasta.rb

Overview

Looks like EuPathDB databases have settled on something like >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 where the species name differs but the rest is mostly constant

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(species_name, filename) ⇒ FastaParser

The species name is what should show up in the 2nd bracket, so something like ‘Toxoplasma_gondii_ME49’ for >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 for instance



15
16
17
18
# File 'lib/eupathdb_fasta.rb', line 15

def initialize(species_name, filename)
  @species_name = species_name
  @filename = filename
end

Instance Attribute Details

#species_nameObject

Returns the value of attribute species_name.



9
10
11
# File 'lib/eupathdb_fasta.rb', line 9

def species_name
  @species_name
end

Instance Method Details

#eachObject

Enumerate through fasta file entries



21
22
23
24
25
26
27
28
# File 'lib/eupathdb_fasta.rb', line 21

def each
  @flat = Bio::FlatFile.open(Bio::FastaFormat, @filename)
  n = next_entry
  while !n.nil?
    yield n
    n = next_entry
  end
end

#next_entryObject

Return the entry in the fasta file, or nil if there is no more or the fasta file could not be opened correctly.



32
33
34
35
36
37
38
39
40
# File 'lib/eupathdb_fasta.rb', line 32

def next_entry
  return nil if !@flat
  n = @flat.next_entry
  return nil if !n
  
  s = parse_name(n.definition)
  s.sequence = n.seq
  return s
end

#parse_name(definition) ⇒ Object



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/eupathdb_fasta.rb', line 42

def parse_name(definition)
  s = FastaAnnotation.new
  
  regex = /^(\S+)\|(.*?) \| organism=#{@species_name} \| product=(.*?) \| location=(.*) \| length=\d+$/
  matches = definition.match(regex)
  
  if !matches
    raise Exception, "Definition line has unexpected format: `#{definition}'. Trying to match this line to the regular expression `#{regex.inspect}'"
  end
  
  matches2 = matches[4].match(/^(.+?)\:/)
  if !matches2
    raise ParseException, "Definition line has unexpected scaffold format: #{matches[4]}"
  end
  s.sequencing_centre = matches[1]
  s.scaffold = matches2[1]
  s.gene_id = matches[2]
  s.annotation = matches[3]
  return s
end