Class: SequenceServer::Doctor

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/sequenceserver/doctor.rb

Overview

Doctor detects inconsistencies likely to cause problems with Sequenceserver operation.

Constant Summary collapse

ERROR_PARSE_SEQIDS =
1
ERROR_NUMERIC_IDS =
2
ERROR_PROBLEMATIC_IDS =
3
AVOID_ID_REGEX =
/^(?!gi|bbs)\w+\|\w*\|?/

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeDoctor

Returns a new instance of Doctor.



98
99
100
101
# File 'lib/sequenceserver/doctor.rb', line 98

def initialize
  @ignore     = []
  @all_seqids = Doctor.all_sequence_ids(@ignore)
end

Instance Attribute Details

#all_seqidsObject (readonly)

Returns the value of attribute all_seqids.



103
104
105
# File 'lib/sequenceserver/doctor.rb', line 103

def all_seqids
  @all_seqids
end

#invalidsObject (readonly)

Returns the value of attribute invalids.



103
104
105
# File 'lib/sequenceserver/doctor.rb', line 103

def invalids
  @invalids
end

Class Method Details

.all_sequence_ids(ignore) ⇒ Object

Retrieve sequence ids (specified by %i) from all databases. Using accession number is problematic because of several reasons.



31
32
33
34
35
36
37
38
39
40
41
# File 'lib/sequenceserver/doctor.rb', line 31

def all_sequence_ids(ignore)
  Database.map do |db|
    next if ignore.include? db

    out = `blastdbcmd -entry all -db #{db.name} -outfmt "%i" 2> /dev/null`
    {
      db:     db,
      seqids: out.to_s.split
    }
  end.compact
end

.bullet_list(values) ⇒ Object

Pretty print database list.



54
55
56
57
58
59
60
# File 'lib/sequenceserver/doctor.rb', line 54

def bullet_list(values)
  list = ''
  values.each do |value|
    list << "      - #{value}\n"
  end
  list
end

.inspect_parse_seqids(seqids) ⇒ Object

FASTA files formatted without -parse_seqids option won’t support the blastdbcmd command of fetching sequence ids using ‘%i’ identifier. In such cases, an array of ‘N/A’ values are returned which is checked in this case.



47
48
49
50
51
# File 'lib/sequenceserver/doctor.rb', line 47

def inspect_parse_seqids(seqids)
  seqids.map do |sq|
    sq[:db] if sq[:seqids].include? 'N/A'
  end.compact
end

.inspect_seqids(seqids, &block) ⇒ Object

Returns an array of database objects in which each of the object has an array of sequence_ids satisfying the block passed to the method.



23
24
25
26
27
# File 'lib/sequenceserver/doctor.rb', line 23

def inspect_seqids(seqids, &block)
  seqids.map do |sq|
    sq[:db] unless sq[:seqids].select(&block).empty?
  end.compact
end

.show_message(error, values) ⇒ Object

Print diagnostic error messages according to the type of error. rubocop:disable Metrics/MethodLength



64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/sequenceserver/doctor.rb', line 64

def show_message(error, values)
  return if values.empty?

  case error
  when ERROR_PARSE_SEQIDS
    puts <<~MSG
      *** Doctor has found improperly formatted database:
      #{bullet_list(values)}
      Please reformat your databases with -parse_seqids switch (or use
      sequenceserver -m) for using SequenceServer as the current format
      may cause problems.

      These databases are ignored in further checks.
    MSG

  when ERROR_NUMERIC_IDS
    puts <<~MSG
      *** Doctor has found databases with numeric sequence ids:
      #{bullet_list(values)}
      Note that this may cause problems with sequence retrieval.
    MSG

  when ERROR_PROBLEMATIC_IDS
    puts <<~MSG
      *** Doctor has found databases with problematic sequence ids:
      #{bullet_list(values)}
      This causes some sequence to contain extraneous words like `gnl|`
      appended to their id string.
    MSG
  end
end

Instance Method Details

#check_id_formatObject

Warn users about sequence identifiers of format abc|def because then BLAST+ appends a gnl (for general) infront of the database identifiers. There are only two identifiers that we need to avoid when searching for this format. bbs|number, gi|number Note that while sequence ids could have been arbitrary, using parse_seqids reduces our search space substantially.



147
148
149
150
151
152
# File 'lib/sequenceserver/doctor.rb', line 147

def check_id_format
  selector = proc { |id| id.match(AVOID_ID_REGEX) }

  Doctor.show_message(ERROR_PROBLEMATIC_IDS,
                      Doctor.inspect_seqids(@all_seqids, &selector))
end

#check_numeric_idsObject

Check for the presence of numeric sequence ids within a database.



133
134
135
136
137
138
# File 'lib/sequenceserver/doctor.rb', line 133

def check_numeric_ids
  selector = proc { |id| !id.to_i.zero? }

  Doctor.show_message(ERROR_NUMERIC_IDS,
                      Doctor.inspect_seqids(@all_seqids, &selector))
end

#check_parse_seqidsObject

Obtain files which aren’t formatted with -parse_seqids and add them to ignore list.



125
126
127
128
129
130
# File 'lib/sequenceserver/doctor.rb', line 125

def check_parse_seqids
  without_parse_seqids = Doctor.inspect_parse_seqids(@all_seqids)
  Doctor.show_message(ERROR_PARSE_SEQIDS, without_parse_seqids)

  @ignore.concat(without_parse_seqids)
end

#diagnoseObject



105
106
107
108
109
110
111
112
113
114
115
# File 'lib/sequenceserver/doctor.rb', line 105

def diagnose
  puts "\n1/3 Inspecting databases for proper -parse_seqids formatting.."
  check_parse_seqids
  remove_invalid_databases

  puts "\n2/3 Inspecting databases for numeric sequence ids.."
  check_numeric_ids

  puts "\n3/3 Inspecting databases for problematic sequence ids.."
  check_id_format
end

#remove_invalid_databasesObject

Remove entried which are in ignore list or not formatted with -parse_seqids option.



119
120
121
# File 'lib/sequenceserver/doctor.rb', line 119

def remove_invalid_databases
  @all_seqids.delete_if { |sq| @ignore.include? sq[:db] }
end