Class: GeneValidator::BlastUtils
- Inherits:
-
Object
- Object
- GeneValidator::BlastUtils
- Extended by:
- Forwardable
- Defined in:
- lib/genevalidator/blast.rb
Overview
Contains methods that run BLAST and methods that analyse sequences
Constant Summary collapse
- EVALUE =
1e-5
Class Method Summary collapse
-
.guess_sequence_type(sequence_string) ⇒ Object
Strips all non-letter characters.
- .guess_sequence_type_from_input_file(file = ) ⇒ Object
-
.parse_next(iterator, type = ) ⇒ Object
Parses the next query from the blast xml output query Params:
iterator
: blast xml iterator for hitstype
: the type of the sequence: :nucleotide or :protein Outputs: Array ofSequence
objects corresponding to the list of hits. -
.run_blast(query, db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) ⇒ Object
Calls blast from standard input with specific parameters Params:
blast_type
: blast command in String format (e.g ‘blast(x/p)’)query
: String containing the the query in fasta formatdb
: databasenum_threads
: The number of threads to run BLAST with. -
.run_blast_on_input_file(input_file = opt[:input_fasta_file], db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) ⇒ Object
Runs BLAST on an input file Params:
blast_type
: blast command in String format (e.g ‘blastx’ or ‘blastp’)opt
: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threadsgapopen
: gapopen blast parametergapextend
: gapextend blast parameternr_hits
: max number of hits Output: XML file. -
.type_of_sequences(fasta_format_string) ⇒ Object
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence.
Class Method Details
.guess_sequence_type(sequence_string) ⇒ Object
Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters… returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string
: String to validate Output: nil, :nucleotide or :protein
182 183 184 185 186 187 188 189 |
# File 'lib/genevalidator/blast.rb', line 182 def guess_sequence_type(sequence_string) # removing non-letter and ambiguous characters cleaned_sequence = sequence_string.gsub(/[^A-Z]|[NX]/i, '') return nil if cleaned_sequence.length < 10 # conservative type = Bio::Sequence.new(cleaned_sequence).guess(0.9) (type == Bio::Sequence::NA) ? :nucleotide : :protein end |
.guess_sequence_type_from_input_file(file = ) ⇒ Object
193 194 195 196 197 198 199 200 |
# File 'lib/genevalidator/blast.rb', line 193 def guess_sequence_type_from_input_file(file = opt[:input_fasta_file]) lines = File.foreach(file).first(10) seqs = '' lines.each do |l| seqs += l.chomp unless l[0] == '>' end guess_sequence_type(seqs) end |
.parse_next(iterator, type = ) ⇒ Object
Parses the next query from the blast xml output query Params: iterator
: blast xml iterator for hits type
: the type of the sequence: :nucleotide or :protein Outputs: Array of Sequence
objects corresponding to the list of hits
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/genevalidator/blast.rb', line 87 def parse_next(iterator, type = config[:type]) hits = [] iter = iterator.next # parse blast the xml output and get the hits # hits obtained are proteins! (we use only blastp and blastx) iter.each do |hit| seq = Query.new seq.length_protein = hit.len.to_i seq.type = :protein seq.identifier = hit.hit_id seq.definition = hit.hit_def seq.accession_no = hit.accession # get all high-scoring segment pairs (hsp) hsps = [] hit.hsps.each do |hsp| current_hsp = Hsp.new current_hsp.hsp_evalue = format('%.0e', hsp.evalue) current_hsp.hit_from = hsp.hit_from.to_i current_hsp.hit_to = hsp.hit_to.to_i current_hsp.match_query_from = hsp.query_from.to_i current_hsp.match_query_to = hsp.query_to.to_i if type == :nucleotide current_hsp.match_query_from /= 3 current_hsp.match_query_to /= 3 current_hsp.match_query_from += 1 current_hsp.match_query_to += 1 end current_hsp.query_reading_frame = hsp.query_frame.to_i current_hsp.hit_alignment = hsp.hseq.to_s if guess_sequence_type(current_hsp.hit_alignment) != :protein fail SequenceTypeError end current_hsp.query_alignment = hsp.qseq.to_s if guess_sequence_type(current_hsp.query_alignment) != :protein fail SequenceTypeError end current_hsp.align_len = hsp.align_len.to_i current_hsp.identity = hsp.identity.to_i current_hsp.pidentity = (100 * hsp.identity / hsp.align_len.to_f) .round(2) hsps.push(current_hsp) end seq.hsp_list = hsps hits.push(seq) end hits rescue SequenceTypeError => e $stderr.puts e exit 1 rescue StopIteration nil end |
.run_blast(query, db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) ⇒ Object
Calls blast from standard input with specific parameters Params: blast_type
: blast command in String format (e.g ‘blast(x/p)’) query
: String containing the the query in fasta format db
: database num_threads
: The number of threads to run BLAST with. Output: String with the blast xml output
28 29 30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/genevalidator/blast.rb', line 28 def run_blast(query, db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) blast_type = (seq_type == :protein) ? 'blastp' : 'blastx' # -num_threads is not supported on remote databases threads = (db !~ /remote/) ? "-num_threads #{num_threads}" : '' blastcmd = "#{blast_type} -db '#{db}' -evalue #{EVALUE} -outfmt 5" \ " #{threads}" cmd = "echo \"#{query}\" | #{blastcmd}" `#{cmd} >/dev/null 2>&1` end |
.run_blast_on_input_file(input_file = opt[:input_fasta_file], db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) ⇒ Object
Runs BLAST on an input file Params: blast_type
: blast command in String format (e.g ‘blastx’ or ‘blastp’) opt
: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threads gapopen
: gapopen blast parameter gapextend
: gapextend blast parameter nr_hits
: max number of hits Output: XML file
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/genevalidator/blast.rb', line 52 def run_blast_on_input_file(input_file = opt[:input_fasta_file], db = opt[:db], seq_type = config[:type], num_threads = opt[:num_threads]) return if opt[:blast_xml_file] || opt[:blast_tabular_file] $stderr.puts 'Running BLAST. This may take a while.' opt[:blast_xml_file] = input_file + '.blast_xml' blast_type = (seq_type == :protein) ? 'blastp' : 'blastx' # -num_threads is not supported on remote databases threads = (opt[:db] !~ /remote/) ? "-num_threads #{num_threads}" : '' blastcmd = "#{blast_type} -query '#{input_file}'" \ " -out '#{opt[:blast_xml_file]}' -db #{db} " \ " -evalue #{EVALUE} -outfmt 5 #{threads}" `#{blastcmd} >/dev/null 2>&1` return unless File.zero?(opt[:blast_xml_file]) $stderr.puts 'Blast failed to run on the input file.' if opt[:db] !~ /remote/ $stderr.puts 'Please ensure that the BLAST database exists and try' $stderr.puts 'again.' else $stderr.puts 'You are using BLAST with a remote database. Please' $stderr.puts 'ensure that you have internet access and try again.' end end |
.type_of_sequences(fasta_format_string) ⇒ Object
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string
: String to validate Output: nil, :nucleotide or :protein
163 164 165 166 167 168 169 170 171 172 |
# File 'lib/genevalidator/blast.rb', line 163 def type_of_sequences(fasta_format_string) # the first sequence does not need to have a fasta definition line sequences = fasta_format_string.split(/^>.*$/).delete_if(&:empty?) # get all sequence types sequence_types = sequences.collect { |seq| guess_sequence_type(seq) } .uniq.compact return nil if sequence_types.empty? sequence_types.first if sequence_types.length == 1 end |