Class: ViralSeq::SeqHash

Inherits:

Object

Object
ViralSeq::SeqHash

show all

Defined in:: lib/viral_seq/seq_hash.rb,
lib/viral_seq/hivdr.rb

Overview

ViralSeq::SeqHash class for operation on multiple sequences.

Examples:

read a FASTA sequence file of HIV PR sequences, make alignment, perform the QC location check, filter sequences with stop codons and APOBEC3g/f hypermutations, calculate pairwise diversity, calculate minority cut-off based on Poisson model, and examine for drug resistance mutations.

my_pr_seqhash = ViralSeq::SeqHash.fa('my_pr_fasta_file.fasta')
  # new ViralSeq::SeqHash object from a FASTA file
aligned_pr_seqhash = my_pr_seqhash.align
  # align with MUSCLE
filtered_seqhash = aligned_pr_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
  # filter nt sequences with the reference coordinates
filtered_seqhash = aligned_pr_seqhash.stop_codon[:without_stop_codon]
  # return a new ViralSeq::SeqHash object without stop codons
filtered_seqhash = filtered_seqhash.a3g[:filtered_seq]
  # further filter out sequences with A3G hypermutations
filtered_seqhash.pi
  # return pairwise diveristy π
cut_off = filtered_seqhash.pm
  # return cut-off for minority variants based on Poisson model
filtered_seqhash.sdrm_hiv_pr(cut_off)
  # examine for drug resistance mutations for PR region.

Instance Attribute Summary collapse

#aa_hash ⇒ Hash

Hash object for :name => :amino_acid_sequence_string pairs.
#dna_hash ⇒ Hash

Hash object for :name => :sequence_string pairs.
#file ⇒ String

The file that is used to initialize SeqHash object, if it exists.
#qc_hash ⇒ Hash

Hash object for :name => :qc_score_string pairs.
#title ⇒ String

default as the file basename if SeqHash object is initialized using ::fa or ::fq.

Class Method Summary collapse

.new_from_aa_fasta(infile) ⇒ ViralSeq::SeqHash (also: aa_fa)

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file of amino acid sequences.
.new_from_array(seq_array, master_tag = 'seq') ⇒ ViralSeq::SeqHash (also: array)

initialize a ViralSeq::SeqHash object with an array of sequence strings.
.new_from_fasta(infile) ⇒ ViralSeq::SeqHash (also: fa)

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file.
.new_from_fastq(fastq_file) ⇒ ViralSeq::SeqHash (also: fq)

initialize a new ViralSeq::SeqHash object from a FASTQ format sequence file.

Instance Method Summary collapse

#+(sh2) ⇒ ViralSeq::SeqHash

combine SeqHash objects.
#a3g_hypermut(ref = nil) ⇒ Hash (also: #a3g)

function to determine if the sequences have APOBEC3g/f hypermutation.
#align(algorithm = :PPP, path_to_muscle = false) ⇒ SeqHash

align the @dna_hash sequences, return a new ViralSeq::SeqHash object with aligned @dna_hash using MUSCLE.
#check_nt_size ⇒ Hash

check the size range of the DNA sequences of the SeqHash object.
#collapse(cutoff = 1) ⇒ ViralSeq::SeqHash

Collapse sequences by difference cut-offs.
#complete_with_ref(region_config) ⇒ Object
#consensus(cutoff = 0.5) ⇒ String

create one consensus sequence from @dna_hash with an optional majority cut-off for mixed bases.
#drm(region_config) ⇒ Object

function to interpret HIV drms with ViralSeq::DrmRegionConfig as a param.
#error_table(ref = self.consensus, head = true) ⇒ Array

return an table of frequencies of nucleotides at each position.
#fdr(error_rate = 0.0001) ⇒ Hash

calculate false detection rate for minority mutations Credit: Prof.
#filter_for_drm(region_config) ⇒ Object

wrapper function for #a3g_hypermut and #stop_codon with ViralSeq::DrmRegionConfig as a param.
#filter_similar_pid(cutoff = 10) ⇒ ViralSeq::SeqHash

Remove sequences with residual offspring Primer IDs.
#gap_strip(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment, all positions that contains gaps (‘-’) will be removed.
#gap_strip_ends(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment at both ends, only positions at the ends that contains gaps (‘-’) will be removed.
#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

quality check for HIV sequences based on ViralSeq::Sequence#locator, check if sequences are in the target range.
#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ SeqHash constructor

initialize a ViralSeq::SeqHash object.
#mutation(error_rate = 0.01) ⇒ ViralSeq::SeqHash

mutate @dna_hash based on the error_rate.
#nt_range(range) ⇒ ViralSeq::SeqHash

return a new SeqHash object with given a range on the nt sequence position.
#nt_variants ⇒ Hash

analysis for the nt sequence variants.
#nucleotide_pi ⇒ Float (also: #pi)

Function to calculate nucleotide diversity π, for nt sequence only.
#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ Integer (also: #pm)

Define Poission cut-off for minority variants.
#qc_indel ⇒ Hash

QC for each nucleotide sequence comparing with sample consensus for indels.
#random_select(n = 100) ⇒ ViralSeq::SeqHash

randomly select n number of sequences from the orginal SeqHash object.
#sample(n = 1) ⇒ ViralSeq::SeqHash

sample a certain number of sequences from a SeqHash object.
#sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV IN region.
#sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV PR region.
#sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV RT region.
#sequence_locator(ref_option = :HXB2) ⇒ Array (also: #loc)

sequence locator for SeqHash object, resembling HIV Sequence Locator from LANL.
#shannons_entropy(option = :nt) ⇒ Hash

calculate Shannon’s entropy, Euler’s number as the base of logarithm.
#size ⇒ Integer

the size of nt sequence hash of the SeqHash object.
#stop_codon(codon_position = 0) ⇒ Hash

screen for sequences with stop codons.
#sub(keys) ⇒ SeqHash

given an Array of sequence tags, return a sub ViralSeq::SeqHash object with the sequence tags.
#tn93 ⇒ Hash

TN93 distance functionl, tabulate pairwise comparison of sequence pairs in a sequence alignment, nt sequence only.
#to_rsphylip ⇒ String

generate sequences in relaxed sequencial phylip format from a ViralSeq::SeqHash object.
#translate(codon_position = 0) ⇒ NilClass

translate the DNA sequences in @dna_hash to amino acid sequences.
#trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

trim dna sequences based on the provided reference coordinates.
#uniq_dna_hash(tag = "sequence") ⇒ ViralSeq::SeqHash (also: #uniq)

collapse @dna_hash to unique sequence hash.
#write_nt_fa(file) ⇒ NilClass

write the nt sequences to a FASTA format file.

Constructor Details

#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ `SeqHash`

initialize a ViralSeq::SeqHash object

# File 'lib/viral_seq/seq_hash.rb', line 25

def initialize (dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "")
  @dna_hash = dna_hash
  @aa_hash = aa_hash
  @qc_hash = qc_hash
  @title = title
  @file = file
end

Instance Attribute Details

#aa_hash ⇒ `Hash`

Returns Hash object for :name => :amino_acid_sequence_string pairs.

Returns:

(Hash) —

Hash object for :name => :amino_acid_sequence_string pairs



37
38
39

# File 'lib/viral_seq/seq_hash.rb', line 37

def aa_hash
  @aa_hash
end

#dna_hash ⇒ `Hash`

Returns Hash object for :name => :sequence_string pairs.

Returns:

(Hash) —

Hash object for :name => :sequence_string pairs



34
35
36

# File 'lib/viral_seq/seq_hash.rb', line 34

def dna_hash
  @dna_hash
end

#file ⇒ `String`

Returns the file that is used to initialize SeqHash object, if it exists.

Returns:

(String) —

the file that is used to initialize SeqHash object, if it exists



47
48
49

# File 'lib/viral_seq/seq_hash.rb', line 47

def file
  @file
end

#qc_hash ⇒ `Hash`

Returns Hash object for :name => :qc_score_string pairs.

Returns:

(Hash) —

Hash object for :name => :qc_score_string pairs



40
41
42

# File 'lib/viral_seq/seq_hash.rb', line 40

def qc_hash
  @qc_hash
end

#title ⇒ `String`

default as the file basename if SeqHash object is initialized using ::fa or ::fq

Returns:

(String) —

the title of the SeqHash object.



44
45
46

# File 'lib/viral_seq/seq_hash.rb', line 44

def title
  @title
end

Class Method Details

.new_from_aa_fasta(infile) ⇒ `ViralSeq::SeqHash` Also known as: aa_fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file of amino acid sequences

Parameters:

infile (String) —

path to the FASTA format sequence file of aa sequences

Returns:

(ViralSeq::SeqHash)

# File 'lib/viral_seq/seq_hash.rb', line 82

def self.new_from_aa_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.aa_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_array(seq_array, master_tag = 'seq') ⇒ `ViralSeq::SeqHash` Also known as: array

initialize a ViralSeq::SeqHash object with an array of sequence strings

Parameters:

master_tag (String) (defaults to: 'seq') —

master tag to put in the sequence names

Returns:

(ViralSeq::SeqHash) —

No @qc_hash, @title will be the master_tag

# File 'lib/viral_seq/seq_hash.rb', line 150

def self.new_from_array(seq_array,master_tag = 'seq')
  n = 1
  hash = {}
  seq_array.each do |seq|
    hash[master_tag + "_" + n.to_s] = seq
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = hash
  seq_hash.title = master_tag
  return seq_hash
end

.new_from_fasta(infile) ⇒ `ViralSeq::SeqHash` Also known as: fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTA file

ViralSeq::SeqHash.fa('my_fasta_file.fasta')

Parameters:

infile (String) —

path to the FASTA format sequence file

Returns:

(ViralSeq::SeqHash)

# File 'lib/viral_seq/seq_hash.rb', line 55

def self.new_from_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_fastq(fastq_file) ⇒ `ViralSeq::SeqHash` Also known as: fq

initialize a new ViralSeq::SeqHash object from a FASTQ format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTQ file

ViralSeq::SeqHash.fq('my_fastq_file.fastq')

Parameters:

fastq_file (String) —

path to the FASTA format sequence file

Returns:

(ViralSeq::SeqHash)

# File 'lib/viral_seq/seq_hash.rb', line 111

def self.new_from_fastq(fastq_file)
  count = 0
  sequence_a = []
  quality_a = []
  count_seq = 0

  File.open(fastq_file,'r') do |file|
    file.readlines.collect do |line|
      line.tr!("\u0000","")
      next if line == "\n"
      count +=1
      count_m = count % 4
      if count_m == 1
        line.tr!('@','>')
        sequence_a << line.chomp
        quality_a << line.chomp
        count_seq += 1
      elsif count_m == 2
        sequence_a << line.chomp
      elsif count_m == 0
        quality_a << line.chomp
      end
    end
  end
  sequence_hash = Hash[sequence_a.each_slice(2).to_a]
  quality_hash = Hash[quality_a.each_slice(2).to_a]

  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = sequence_hash
  seq_hash.qc_hash = quality_hash
  seq_hash.title = File.basename(fastq_file,".*")
  seq_hash.file = fastq_file
  return seq_hash
end

Instance Method Details

#+(sh2) ⇒ `ViralSeq::SeqHash`

combine SeqHash objects

Parameters:

sh2 (ViralSeq::SeqHash) —

another SeqHash

Returns:

(ViralSeq::SeqHash) —

combined SeqHash

# File 'lib/viral_seq/seq_hash.rb', line 182

def +(sh2)
  new_seqhash = ViralSeq::SeqHash.new
  new_seqhash.dna_hash = self.dna_hash.merge(sh2.dna_hash)
  new_seqhash.aa_hash = self.aa_hash.merge(sh2.aa_hash)
  new_seqhash.qc_hash = self.qc_hash.merge(sh2.qc_hash)
  new_seqhash.title = self.title + "_with_" + sh2.title
  new_seqhash.file = self.file + "," + sh2.file
  return new_seqhash
end

#a3g_hypermut(ref = nil) ⇒ `Hash` Also known as: a3g

function to determine if the sequences have APOBEC3g/f hypermutation.

# APOBEC3G/F pattern: GRD -> ARD
# control pattern: G[YN|RC] -> A[YN|RC]
# use the sample consensus to determine potential a3g sites (default) or provide external reference sequences as a `String`
# Two criteria to identify hypermutation
# 1. Fisher's exact test on the frequencies of G to A mutation at A3G positions vs. non-A3G positions
# 2. Poisson distribution of G to A mutations at A3G positions, outliers sequences
# note:  criteria 2 only applies on a sequence file containing more than 20 sequences,
#        b/c Poisson model does not do well on small sample size.

Examples:

identify apobec3gf mutations from a sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence1.fasta')
hypermut = my_seqhash.a3g
hypermut[:a3g_seq].dna_hash.keys
=> [">Seq7", ">Seq14"]
hypermut[:filtered_seq].dna_hash.keys
=> [">Seq1", ">Seq2", ">Seq5"]
hypermut[:stats]
=> [[">Seq7", 23, 68, 1, 54, 18.26, 4.308329383112348e-06], [">Seq14", 45, 68, 9, 54, 3.97, 5.2143571971582974e-08]]

identify apobec3gf mutations from another sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence2.fasta')
hypermut = my_seqhash.a3g
hypermut[:stats]
=> [[">CTAACACTCA_134_a3g-sample2", 4, 35, 0, 51, Infinity, 0.02465676660128911], [">ATAGTGCCCA_60_a3g-sample2", 4, 35, 1, 51, 5.83, 0.1534487353839561]]
# notice sequence ">ATAGTGCCCA_60_a3g-sample2" has a p value at 0.15, greater than 0.05,
# but it is still called as hypermutation sequence b/c it's Poisson outlier sequence.

Returns:

(Hash) —
three paris. :a3g_seq: a ViralSeq:SeqHash object for sequences with hypermutations :filtered_seq : a ViralSeq:SeqHash object for sequences without hypermutations :stats : a two-demensional array ‘[[a,b], [c,d]]` for statistic_info, including the following information,
```
# sequence tag
# G to A mutation numbers at potential a3g positions
# total potential a3g G positions
# G to A mutation numbers at non a3g positions
# total non a3g G positions
# a3g G to A mutation rate / non-a3g G to A mutation rate
# Fishers Exact P-value
```

See Also:

LANL Hypermut

# File 'lib/viral_seq/seq_hash.rb', line 489

def a3g_hypermut(ref = nil)
  # mut_hash number of apobec3g/f mutations per sequence
  mut_hash = {}
  hm_hash = {}
  out_hash = {}

  # total G->A mutations at apobec3g/f positions.
  total = 0

  unless ref
    # make consensus sequence for the input sequence hash
    ref = self.consensus
  end

  # obtain apobec3g positions and control positions
  apobec = apobec3gf(ref)
  mut = apobec[0]
  control = apobec[1]

  self.dna_hash.each do |k,v|
    a = 0 # muts
    b = 0 # potential mut sites
    c = 0 # control muts
    d = 0 # potenrial controls
    mut.each do |n|
      if v[n] == "A"
        a += 1
        b += 1
      else
        b += 1
      end
    end
    mut_hash[k] = a
    total += a

    control.each do |n|
      if v[n] == "A"
        c += 1
        d += 1
      else
        d += 1
      end
    end
    rr = (a/b.to_f)/(c/d.to_f)

    t1 = b - a
    t2 = d - c

    fet = ViralSeq::Rubystats::FishersExactTest.new
    fisher = fet.calculate(t1,t2,a,c)
    perc = fisher[:twotail]
    info = [k, a, b, c, d, rr.round(2), perc]
    out_hash[k] = info
    if perc < 0.05
      hm_hash[k] = info
    end
  end

  if self.dna_hash.size > 200
    rate = total.to_f/(self.dna_hash.size)
    count_mut = mut_hash.values.count_freq
    maxi_count = count_mut.values.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate,maxi_count).poisson_hash
    cut_off = 0
    poisson_hash.each do |k,v|
      cal = self.dna_hash.size * v
      obs = count_mut[k]
      if obs >= 20 * cal
        cut_off = k
        break
      elsif k == maxi_count
        cut_off = maxi_count
      end
    end
    mut_hash.each do |k,v|
      if v > cut_off
        hm_hash[k] = out_hash[k]
      end
    end
  end

  hm_seq_hash = ViralSeq::SeqHash.new
  hm_hash.each do |k,_v|
    hm_seq_hash.dna_hash[k] = self.dna_hash[k]
  end

  hm_seq_hash.title = self.title + "_hypermut"
  hm_seq_hash.file = self.file
  filtered_seq_hash = self.sub(self.dna_hash.keys - hm_hash.keys)
  return { a3g_seq: hm_seq_hash,
           filtered_seq: filtered_seq_hash,
           stats: hm_hash.values
          }
end

#align(algorithm = :PPP, path_to_muscle = false) ⇒ `SeqHash`

align the @dna_hash sequences, return a new ViralSeq::SeqHash object with aligned @dna_hash using MUSCLE

Parameters:

algorithm (Symbol) (defaults to: :PPP) —

, algorithm for MUSCLE5 only. Choose from :PPP or :Super5.
path_to_muscle (String) (defaults to: false) —

, path to MUSCLE excutable. if not provided (as default), it will use RubyGem::MuscleBio

Returns:

(SeqHash) —

new SeqHash object of the aligned @dna_hash, the title has “_aligned”

# File 'lib/viral_seq/seq_hash.rb', line 720

def align(algorithm = :PPP, path_to_muscle = false)
  seq_hash = self.dna_hash
  if self.file.size > 0
    temp_dir = File.dirname(self.file)
  else
    temp_dir=File.dirname($0)
  end

  temp_file = File.join(temp_dir, "_temp_muscle_in")
  temp_aln = File.join(temp_dir, "_temp_muscle_aln")
  File.open(temp_file, 'w'){|f| seq_hash.each {|k,v| f.puts k; f.puts v}}
  if path_to_muscle
    unless ViralSeq::Muscle.check_muscle?(path_to_muscle)
      File.unlink(temp_file)
      return nil
    end
    print `#{path_to_muscle} -in #{temp_file} -out #{temp_aln} -quiet`
  else
    if MuscleBio::VERSION.to_f < 0.5
      MuscleBio.run("muscle -in #{temp_file} -out #{temp_aln} -quiet")
    else
      MuscleBio.exec(temp_file, temp_aln, algorithm)
    end
  end
  out_seq_hash = ViralSeq::SeqHash.fa(temp_aln)
  out_seq_hash.title = self.title + "_aligned"
  out_seq_hash.file = self.file
  File.unlink(temp_file)
  File.unlink(temp_aln)
  return out_seq_hash
end

#check_nt_size ⇒ `Hash`

check the size range of the DNA sequences of the SeqHash object

Returns:

(Hash) —

Hash of MAX_SIZE, min: MIN_SIZE

# File 'lib/viral_seq/seq_hash.rb', line 227

def check_nt_size
  dna_hash = self.dna_hash
  size_array = []
  dna_hash.values.each do |v|
    size_array << v.size
  end
  return { max: size_array.max, min: size_array.min }
end

#collapse(cutoff = 1) ⇒ `ViralSeq::SeqHash`

Collapse sequences by difference cut-offs. Suggesting aligning before using this function.

Parameters:

cutoff (Integer) (defaults to: 1) —

nt base differences. collapse sequences within [cutoff] differences

Returns:

(ViralSeq::SeqHash) —

a new SeqHash object of collapsed sequences

# File 'lib/viral_seq/seq_hash.rb', line 1030

def collapse(cutoff=1)
  seq_array = self.dna_hash.values
  new_seq_freq = {}
  seq_freq = seq_array.count_freq
  if seq_freq.size == 1
    new_seq_freq = seq_freq
  else
    uniq_seq = seq_freq.keys
    unique_seq_pair = uniq_seq.combination(2)
    dupli_seq = []
    unique_seq_pair.each do |pair|
      seq1 = pair[0]
      seq2 = pair[1]
      diff = seq1.compare_with(seq2)
      if diff <= cutoff
        freq1 = seq_freq[seq1]
        freq2 = seq_freq[seq2]
        freq1 >= freq2 ? dupli_seq << seq2 : dupli_seq << seq1
      end
    end

    seq_freq.each do |seq,freq|
      unless dupli_seq.include?(seq)
        new_seq_freq[seq] = freq
      end
    end
  end
  seqhash = ViralSeq::SeqHash.new
  n = 1
  new_seq_freq.each do |seq,freq|
    name = ">seq_" + n.to_s + '_' + freq.to_s
    seqhash.dna_hash[name] = seq
    n += 1
  end
  return seqhash
end

#complete_with_ref(region_config) ⇒ `Object`

# File 'lib/viral_seq/hivdr.rb', line 531

def complete_with_ref(region_config)
  complete_seqs = {}
  seq_coord = region_config.seq_coord

  ref = ViralSeq::RefSeq.get(region_config.ref_info["ref_type"].to_sym)
  a = region_config.ref_info["ref_coord"][0]
  b = region_config.ref_info["ref_coord"][1]
  c = seq_coord["minimum"]
  d = seq_coord["maximum"]

  if seq_coord["gap"]
    e = seq_coord["gap"]["minimum"]
    f = seq_coord["gap"]["maximum"]

    self.dna_hash.each do |k,v|
      complete_seqs[k] = ref[(a-1)..(c-2)] + v[0,(e-c)] + ref[(e-1)..(f-1)] + v[(e-c)..-1] + ref[d..(b-1)]
    end
  else
    self.dna_hash.each do |k,v|
      complete_seqs[k] = ref[(a-1)..(c-2)] + v + ref[d..(b-1)]
    end
  end

  return ViralSeq::SeqHash.new(complete_seqs)
end

#consensus(cutoff = 0.5) ⇒ `String`

create one consensus sequence from @dna_hash with an optional majority cut-off for mixed bases.

Examples:

consensus sequence from an array of sequences.

seq_array = %w{ ATTTTTTTTT
                AATTTTTTTT
                AAATTTTTTT
                AAAATTTTTT
                AAAAATTTTT
                AAAAAATTTT
                AAAAAAATTT
                AAAAAAAATT
                AAAAAAAAAT
                AAAAAAAAAA }
my_seqhash = ViralSeq::SeqHash.array(seq_array)
my_seqhash.consensus
=> 'AAAAAWTTTT'
my_seqhash.consensus(0.7)
=> 'AAAANNNTTT'

Parameters:

cutoff (Float) (defaults to: 0.5) —

majority cut-off for calling consensus bases. defult at (0.5), position with 15% “A” and 85% “G” will be called as “G” with 20% cut-off and as “R” with 10% cut-off. Using (0) will return use simply majority rule (no cutoff)

Returns:

(String) —

consensus sequence

# File 'lib/viral_seq/seq_hash.rb', line 419

def consensus(cutoff = 0.5)
  seq_array = self.dna_hash.values
  seq_length = seq_array[0].size
  seq_size = seq_array.size
  consensus_seq = ""
  (0..(seq_length - 1)).each do |position|
    all_base = []
    seq_array.each do |seq|
      if seq[position]
        all_base << seq[position]
      end
    end
    base_count = all_base.count_freq
    max_base_list = []

    if cutoff.zero?
      max_count = base_count.values.max
      max_base_hash = base_count.select {|_k,v| v == max_count}
      max_base_list = max_base_hash.keys
    else
      base_count.each do |k,v|
        if v/seq_size.to_f >= cutoff
          max_base_list << k
        end
      end
    end
    consensus_seq += call_consensus_base(max_base_list)
  end
  return consensus_seq
end

#drm(region_config) ⇒ `Object`

function to interpret HIV drms with ViralSeq::DrmRegionConfig as a param.

# File 'lib/viral_seq/hivdr.rb', line 560

def drm(region_config)
  region = region_config.region
  fdr_hash = self.fdr # must run fdr before the completion of the sequences

  complete_gene = self.complete_with_ref(region_config)
  sequences = complete_gene.dna_hash

  n_seq = sequences.size
  aa = {}
  mut = {}
  mut_com = []
  point_mutation_list = []

  drm_list = region_config.drm_list

  sequences.each do |name, seq|
    s = ViralSeq::Sequence.new(name, seq)
    s.translate
    aa[name] = s.aa_string

    records_per_seq = {}

    drm_list.each do |drm_class, list|

      mut[drm_class] = {}  if !mut[drm_class]

      record = s.check_drm(list)
      records_per_seq = records_per_seq.merge(record)

      record.each do |position, mutation|
        if !mut[drm_class][position]
          mut[drm_class][position] = [mutation[0],[]]
        end
        mut[drm_class][position][1] << mutation[1]
      end
    end

    mut_com << records_per_seq.sort.to_h
  end

  mut.each do |drm_class, mutations|
    mutations.each do |position, mutation|
      wt = mutation[0]
      mut_list = mutation[1]
      count_mut_list = mut_list.count_freq
      count_mut_list.each do |m,number|
        ci = ViralSeq::Math::BinomCI.new(number, n_seq)
        fdr = fdr_hash[number].round(5)
        label = fdr >= 0.05 ? "*" : ""
        point_mutation_list << [drm_class, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
      end
    end
  end

  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = 1

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]

end

#error_table(ref = self.consensus, head = true) ⇒ `Array`

return an table of frequencies of nucleotides at each position.

Examples:

error table for an array of sequences

array = %w{ AACCGGTT
            AGCCGGTT
            AACTGCTT
            AACCGTTA
            AACCGGTA }
my_seqhash = ViralSeq::SeqHash.array(array)
my_seqhash.error_table.each {|r| puts r.join(',')}
  position,consensus,total_seq_number,A,C,G,T
  1,A,5,-,,,
  2,A,5,-,,0.2,
  3,C,5,,-,,
  4,C,5,,-,,0.2
  5,G,5,,,-,
  6,G,5,,0.2,-,0.2
  7,T,5,,,,-
  8,T,5,0.4,,,-

Parameters:

ref (String) (defaults to: self.consensus) —

a reference sequence to compare with, default as the sample consensus sequence
head (Boolean) (defaults to: true) —

if the head of table is included.

Returns:

(Array) —

a two-dimension array of the frequency table, including the following info:

position on the sequence (starting from 1)
consensus nucleotide
total sequence numbers
percentage of A, shows "-" if agrees with consensus
percentage of C, shows "-" if agrees with consensus
percentage of G, shows "-" if agrees with consensus
percentage of T, shows "-" if agrees with consensus

# File 'lib/viral_seq/seq_hash.rb', line 1246

def error_table(ref = self.consensus, head = true)

  table = []
  if head
    table << %w{
      position
      consensus
      total_seq_number
      A
      C
      G
      T
    }
  end
  ref_size = ref.size

  (0..(ref_size - 1)).each do |position|
    ref_base = ref[position]
    nts = []

    self.dna_hash.each do |_k,v|
      nts << v[position]
    end

    freq = nts.count_freq
    freq2 = {}

    freq.each do |nt,c|
      if nt == ref_base
        freq2[nt] = '-'
      else
        freq2[nt] = (c/(self.size).to_f)
      end
    end

    table << [(position + 1),ref_base,self.size,freq2['A'],freq2['C'],freq2['G'],freq2['T']]
  end

  return table

end

#fdr(error_rate = 0.0001) ⇒ `Hash`

calculate false detection rate for minority mutations Credit: Prof. Michael G. Hudgens from UNC-CH for providing the method for fdr calculation

Examples:

calculate FDR for mutations that appeared twice in the sample dataset

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_sequence_for_poisson.fasta')
fdr_hash = my_seqhash.fdr
fdr_hash[2].round(5)
=> 0.00726 # means that mutations appear twice have 0.007261748 chance to be caused by residual errors.

Parameters:

error_rate (Float) (defaults to: 0.0001) —

estimated sequencing error rate

Returns:

(Hash) —

pair of mutation frequency to false detection rate. (freq => fdr)

# File 'lib/viral_seq/seq_hash.rb', line 632

def fdr(error_rate = 0.0001)
  sequences = self.dna_hash.values
  if sequences.size == 0
    return {}
  else
    seq_count = self.size
    observed_hash = variant_for_poisson(sequences)
    p_unadjusted = []
    observed_hash.each do |k, v|
      p_value = 1 - `Rscript -e "cat(pbinom(#{k}-1, #{seq_count}, #{error_rate}))"`.to_f # compute unadjusted exact p-value, ie under null, probability of observing observed_hash[k] or more extreme
      p_unadjusted += Array.new(v, p_value)
    end
    p_fdr = `Rscript -e "cat(p.adjust(c(#{p_unadjusted.join(',')}), 'fdr'))"`.split("\s").count_freq.to_a # controls fdr. aka Benjamini-Hochberg correction
    vars_pair = observed_hash.to_a
    fdr_hash = Hash.new(0)
    (0..(p_fdr.size - 1)).each do |i|
      fdr_hash[vars_pair[i][0]] = p_fdr[i][0].to_f
    end
    return fdr_hash
  end
end

#filter_for_drm(region_config) ⇒ `Object`

wrapper function for #a3g_hypermut and #stop_codon with ViralSeq::DrmRegionConfig as a param.

# File 'lib/viral_seq/hivdr.rb', line 466

def filter_for_drm(region_config)
  seq_coord = region_config.seq_coord
  reading_frame_number = region_config.get_reading_frame_number

  if !seq_coord["gap"]

    a3g_check = self.a3g
    a3g_seqs = a3g_check[:a3g_seq]
    a3g_filtered_seqs = a3g_check[:filtered_seq]

    stop_codon_check = a3g_filtered_seqs.stop_codon(reading_frame_number[0])
    stop_codon_seqs = stop_codon_check[:with_stop_codon]
    filtered_seqs = stop_codon_check[:without_stop_codon]

    return {
      filtered_seq: filtered_seqs,
      a3g_seq: a3g_seqs,
      stop_codon_seq: stop_codon_seqs
    }

  else

    r1_length, r2_length = region_config.r1_r2_length.values

    r1_seqs = {}
    r2_seqs = {}

    self.dna_hash.each do |k,v|
        r1_seqs[k] = v[0,r1_length]
        r2_seqs[k] = v[r1_length, r2_length]
    end

    r1_sh = ViralSeq::SeqHash.new(r1_seqs)
    r2_sh = ViralSeq::SeqHash.new(r2_seqs)

    a3g_seqs_r1 = r1_sh.a3g[:a3g_seq]
    a3g_seqs_r2 = r2_sh.a3g[:a3g_seq]

    stop_codon_r1 = r1_sh.stop_codon(reading_frame_number[0])[:with_stop_codon]
    stop_codon_r2 = r2_sh.stop_codon(reading_frame_number[1])[:with_stop_codon]

    a3g_seq_keys = (a3g_seqs_r1.dna_hash.keys | a3g_seqs_r2.dna_hash.keys)
    a3g_seqs = ViralSeq::SeqHash.new(self.dna_hash.select {|k, _v| a3g_seq_keys.include? k})

    stop_codon_keys = (stop_codon_r1.dna_hash.keys | stop_codon_r2.dna_hash.keys)
    stop_codon_seqs = ViralSeq::SeqHash.new(self.dna_hash.select {|k, _v| stop_codon_keys.include? k})

    reject_keys = (a3g_seq_keys | stop_codon_keys)

    filtered_seqs = ViralSeq::SeqHash.new(self.dna_hash.reject { |k, _v| reject_keys.include? k })

    return {
      filtered_seq: filtered_seqs,
      a3g_seq: a3g_seqs,
      stop_codon_seq: stop_codon_seqs
    }

  end

end

#filter_similar_pid(cutoff = 10) ⇒ `ViralSeq::SeqHash`

Remove sequences with residual offspring Primer IDs.

Compare PID with sequences which have identical sequences.
PIDs differ by 1 base will be recognized. If PID1 is x time (cutoff) greater than PID2, PID2 will be disgarded.
  each sequence tag starting with ">" and the Primer ID sequence
  followed by the number of Primer ID appeared in the raw sequence
  the information sections in the tags are separated by underscore "_"
  example sequence tag: >AGGCGTAGA_32_sample1_RT

Parameters:

cutoff (Integer) (defaults to: 10) —

the fold cut-off to remove the potential residual offspring Primer IDs

Returns:

(ViralSeq::SeqHash) —

a new SeqHash object without sqeuences containing residual offspring Primer ID

# File 'lib/viral_seq/seq_hash.rb', line 968

def filter_similar_pid(cutoff = 10)
  seq = self.dna_hash.dup
  uni_seq = seq.values.uniq
  uni_seq_pid = {}
  uni_seq.each do |k|
    seq.each do |name,s|
      name = name[1..-1]
      if k == s
        if uni_seq_pid[k]
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        else
          uni_seq_pid[k] = []
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        end
      end
    end
  end

  dup_pid = []
  uni_seq_pid.values.each do |v|
    next if v.size == 1
    pid_hash = Hash[v]
    list = pid_hash.keys
    list2 = Array.new(list)
    pairs = []

    list.each do |k|
      list2.delete(k)
      list2.each do |k1|
        pairs << [k,k1]
      end
    end

    pairs.each do |p|
      pid1 = p[0]
      pid2 = p[1]
      if pid1.compare_with(pid2) <= 1
        n1 = pid_hash[pid1].to_i
        n2 = pid_hash[pid2].to_i
        if n1 >= cutoff * n2
          dup_pid << pid2
        elsif n2 >= cutoff * n1
          dup_pid << pid1
        end
      end
    end
  end

  new_seq = {}
  seq.each do |name,s|
    pid = name.split("_")[0][1..-1]
    unless dup_pid.include?(pid)
      new_seq[name] = s
    end
  end
  self.sub(new_seq.keys)
end

#gap_strip(option = :nt) ⇒ `ViralSeq::SeqHash`

gap strip from a sequence alignment, all positions that contains gaps (‘-’) will be removed

Examples:

gap strip for an array of sequences

array = ["AACCGGTT", "A-CCGGTT", "AAC-GGTT", "AACCG-TT", "AACCGGT-"]
array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip.dna_hash.values
  ACGT
  ACGT
  ACGT
  ACGT
  ACGT

Parameters:

option (Symbol) (defaults to: :nt) —

sequence options for ‘:nt` or `:aa`

Returns:

(ViralSeq::SeqHash) —

a new SeqHash object containing nt or aa sequences without gaps

# File 'lib/viral_seq/seq_hash.rb', line 1085

def gap_strip(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option `#{option}` not recognized"
  end

  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end

  seq_matrix.delete_if do |_p, list|
    list.include?("-")
  end

  sequence_alignment.each do |n,s|
    new_s = ""
    seq_matrix.keys.each {|p| new_s += s[p]}
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#gap_strip_ends(option = :nt) ⇒ `ViralSeq::SeqHash`

gap strip from a sequence alignment at both ends, only positions at the ends that contains gaps (‘-’) will be removed.

Examples:

gap strip for an array of sequences only at the ends

array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip_ends.dna_hash.values
  AACCGGT
  A-CCGGT
  AAC-GGT
  AACCG-T
  AACCGGT

Parameters:

option (Symbol) (defaults to: :nt) —

sequence options for ‘:nt` or `:aa`

Returns:

(ViralSeq::SeqHash) —

a new SeqHash object containing nt or aa sequences without gaps at the ends

# File 'lib/viral_seq/seq_hash.rb', line 1144

def gap_strip_ends(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option #{option} not recognized"
  end
  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end
  n1 = 0
  n2 = 0
  seq_matrix.each do |_p, list|
    if list.include?("-")
      n1 += 1
    else
      break
    end
  end

  seq_matrix.keys.reverse.each do |p|
    list = seq_matrix[p]
    if list.include?("-")
      n2 += 1
    else
      break
    end
  end

  sequence_alignment.each do |n,s|
    new_s = s[n1..(- n2 - 1)]
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ `ViralSeq::SeqHash`

quality check for HIV sequences based on ViralSeq::Sequence#locator, check if sequences are in the target range

Examples:

QC for sequences in a FASTA files

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_seq.fasta')
filtered_seqhash = my_seqhash.hiv_seq_qc([4384,4386], 4750..4752, false, :HXB2)
my_seqhash.dna_hash.size
=> 6
filtered_seqhash.dna_hash.size
=> 4

Parameters:

start_nt (Integer, Range, Array) —

start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array
end_nt (Integer, Range, Array) —

end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array
indel (Boolean) (defaults to: true) —

allow indels or not, ‘ture` or `false`
ref_option (Symbol) (defaults to: :HXB2) —

, name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`
path_to_muscle (String) (defaults to: false) —

, path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:

(ViralSeq::SeqHash) —

a new ViralSeq::SeqHash object with only the sequences that meet the QC criterias

# File 'lib/viral_seq/seq_hash.rb', line 883

def hiv_seq_qc(start_nt, end_nt, indel=true, ref_option = :HXB2, path_to_muscle = false)
  start_nt = start_nt..start_nt if start_nt.is_a?(Integer)
  end_nt = end_nt..end_nt if end_nt.is_a?(Integer)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.values.uniq
  seq_hash_unique_pass = []

  seq_hash_unique.each do |seq|
    next if seq.nil?
    loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
    next unless loc # if locator tool fails, skip this seq.
    if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
      if indel
        seq_hash_unique_pass << seq
      elsif loc[3] == false
        seq_hash_unique_pass << seq
      end
    end
  end
  seq_pass = []
  seq_hash_unique_pass.each do |seq|
    seq_hash.each do |seq_name, orginal_seq|
      if orginal_seq == seq
        seq_pass << seq_name
        seq_hash.delete(seq_name)
      end
    end
  end
  self.sub(seq_pass)
end

#mutation(error_rate = 0.01) ⇒ `ViralSeq::SeqHash`

mutate @dna_hash based on the error_rate

Parameters:

error_rate (Float) (defaults to: 0.01) —

error rate used to mutate sequences.

Returns:

(ViralSeq::SeqHash) —

new SeqHash object of mutated sequences.

# File 'lib/viral_seq/seq_hash.rb', line 1203

def mutation(error_rate = 0.01)
  new_seqhash = ViralSeq::SeqHash.new
  dna = {}
  self.dna_hash.each do |name, seq|
    dna[name + '_mut-' + error_rate.to_s] = seq.mutation(error_rate)
  end
  new_seqhash.dna_hash = dna
  new_seqhash.title = self.title + "_mut-" + error_rate.to_s
  new_seqhash.file = self.file
  return new_seqhash
end

#nt_range(range) ⇒ `ViralSeq::SeqHash`

return a new SeqHash object with given a range on the nt sequence position

Parameters:

range (Range) —

range of positions on the nt sequence

Returns:

(ViralSeq::SeqHash) —

a sub SeqHash object

# File 'lib/viral_seq/seq_hash.rb', line 215

def nt_range(range)
  dna_hash = self.dna_hash
  new_hash = {}
  dna_hash.each do |k,v|
    new_hash[k] = v[range]
  end
  ViralSeq::SeqHash.new(new_hash)
end

#nt_variants ⇒ `Hash`

analysis for the nt sequence variants.

Returns:

(Hash) —

An Hash with information of variant analysis. Key/values of the return object see /docs/variants_structure.pdf

# File 'lib/viral_seq/seq_hash.rb', line 657

def nt_variants
  return_obj = {}
  nt_hash = self.dna_hash
  tcs_number = self.size
  dl = ViralSeq::TcsCore.detection_limit(tcs_number)
  fdr_hash = self.fdr
  pm_cut_off = self.pm
  con = self.consensus
  return_obj[:tcs_number] = tcs_number
  return_obj[:lower_detection_limit] = dl
  return_obj[:pm_cut_off] = pm_cut_off
  return_obj[:positions] = []
  cis = {}

  (0..(con.size - 1)).each do |p|
    position_obj = {}
    position_obj[:position] = p + 1
    position_obj[:tcs_number] = tcs_number
    position_obj[:lower_detection_limit] = dl
    position_obj[:pm_cut_off] = (pm_cut_off == Float::INFINITY ? pm_cut_off.to_s : pm_cut_off)

    nts = []
    dna_hash.each do |n,s|
      nts << s[p]
    end
    freq_hash = nts.count_freq
    [:A, :C, :G, :T, :-].each do |k|
      v = freq_hash[k.to_s]
      position_obj[k] = {}
      position_obj[k][:count] = v
      if v > 0
        if cis[[v, tcs_number]]
          ci = cis[[v, tcs_number]]
        else
          ci = ViralSeq::Math::BinomCI.new(v, tcs_number)
          cis[[v, tcs_number]] = ci
        end
        position_obj[k][:freq] = ci.mean.round(4)
        position_obj[k][:freq_ci_low] = ci.lower.round(4)
        position_obj[k][:freq_ci_high] = ci.upper.round(4)
        position_obj[k][:greater_than_pm] = (v >= pm_cut_off ? true : false)
        position_obj[k][:fdr] = fdr_hash[v]
      else
        position_obj[k][:freq] = 0
        position_obj[k][:freq_ci_low] = 0
        position_obj[k][:freq_ci_high] = 0
        position_obj[k][:greater_than_pm] = false
        position_obj[k][:fdr] = nil
      end
    end

    return_obj[:positions] << position_obj
  end

  return_obj
end

#nucleotide_pi ⇒ `Float` Also known as: pi

Function to calculate nucleotide diversity π, for nt sequence only

Examples:

calculate π

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.pi
=> 0.16667

Returns:

(Float) —

nucleotide diversity π

See Also:

Nucleotide Diversity (Wikipedia)

# File 'lib/viral_seq/seq_hash.rb', line 802

def nucleotide_pi
  sequences = self.dna_hash.values
  seq_length = sequences[0].size - 1
  nt_position_hash = {}
  (0..seq_length).each do |n|
    nt_position_hash[n] = []
    sequences.each do |s|
      nt_position_hash[n] << s[n]
    end
  end
  diver = 0
  com = 0
  nt_position_hash.each do |_p,nt|
    nt.delete_if {|n| n =~ /[^A|^C|^G|^T]/}
    next if nt.size == 1
    nt_count = nt.count_freq
    combination = (nt.size)*(nt.size - 1)/2
    com += combination
    a = nt_count["A"]
    c = nt_count["C"]
    t = nt_count["T"]
    g = nt_count["G"]
    div = a*c + a*t + a*g + c*t + c*g + t*g
    diver += div
  end
  pi = (diver/com.to_f).round(5)
  return pi
end

#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ `Integer` Also known as: pm

Define Poission cut-off for minority variants.

Examples:

obtain Poisson minority cut-off from the example sequence FASTA file.

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_sequence_for_poisson.fasta')
my_seqhash.pm
=> 2 # means that mutations appear at least 2 times are very likely to be a true mutation instead of random methods errors.

Parameters:

error_rate (Float) (defaults to: 0.0001) —

estimated sequencing error rate
fold_cutoff (Integer) (defaults to: 20) —

a fold cut-off to determine poisson minority cut-off. default = 20. i.e. <5% mutations from random methods error.

Returns:

(Integer) —

a cut-off for minority variants (>=).

See Also:

Ref: Zhou, et al. J Virol 2015

# File 'lib/viral_seq/seq_hash.rb', line 596

def poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20)
  sequences = self.dna_hash.values
  if sequences.size == 0
    return 0
  else
    cut_off = Float::INFINITY
    l = sequences[0].size
    rate = sequences.size * error_rate
    count_mut = variant_for_poisson(sequences)
    max_count = count_mut.keys.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate, max_count).poisson_hash

    poisson_hash.each do |k,v|
      cal = l * v
      obs = count_mut[k] ? count_mut[k] : 1
      if obs >= fold_cutoff * cal
        cut_off = k
        break
      end
    end
    return cut_off
  end
end

#qc_indel ⇒ `Hash`

QC for each nucleotide sequence comparing with sample consensus for indels

Returns:

(Hash) —

object containing two SeqHash seq_hash, has_indel: seq_hash

# File 'lib/viral_seq/seq_hash.rb', line 1313

def qc_indel
  con = self.consensus
  dna_hash = self.dna_hash
  names_passed = []
  names_indel = []
  dna_hash.uniq_hash.each do |seq, names|
    if seq.compare_with(con) < 4
      names_passed += names
    elsif ViralSeq::Muscle.align(con, seq)[0]["-"]
      names_indel += names
    else
      names_passed += names
    end
  end
  return {no_indel: self.sub(names_passed),
    has_indel: self.sub(names_indel)}
end

#random_select(n = 100) ⇒ `ViralSeq::SeqHash`

randomly select n number of sequences from the orginal SeqHash object

Parameters:

n (Integer) (defaults to: 100) —

number of sequences to randomly select

Returns:

(ViralSeq::SeqHash) —

a new SeqHash object with randomly selected sequences

# File 'lib/viral_seq/seq_hash.rb', line 1292

def random_select(n = 100)
  new_sh = ViralSeq::SeqHash.new
  dna_hash = self.dna_hash
  aa_hash = self.aa_hash
  qc_hash = self.qc_hash

  keys = dna_hash.keys.sample(n)

  keys.each do |k|
    new_sh.dna_hash[k] = dna_hash[k]
    new_sh.aa_hash[k] = aa_hash[k]
    new_sh.qc_hash[k] = qc_hash[k]
  end
  new_sh.file = self.file
  new_sh.title = self.title + "_" + n.to_s
  return new_sh
end

#sample(n = 1) ⇒ `ViralSeq::SeqHash`

sample a certain number of sequences from a SeqHash object

Parameters:

n (Integer) (defaults to: 1) —

number of sequences to sample

Returns:

(ViralSeq::SeqHash) —

sampled SeqHash

# File 'lib/viral_seq/seq_hash.rb', line 196

def sample(n = 1)
  keys = self.dna_hash.keys
  sampled_keys = keys.sample(n)
  sampled_nt = {}
  sampled_aa = {}
  sampled_qc = {}
  sampled_title = self.title + "_sampled_" + n.to_s
  sampled_keys.each do |k|
    sampled_nt[k] = self.dna_hash[k]
    sampled_aa[k] = self.aa_hash[k]
    sampled_qc[k] = self.qc_hash[k]
  end
  return ViralSeq::SeqHash.new(sampled_nt, sampled_aa, sampled_qc, sampled_title, self.file)
end

#sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV IN region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
IN codon 53-174

Parameters:

cutoff (Integer) (defaults to: 0) —

cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function
fdr (Hash) —

hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

(Array) —
three elements ‘[point_mutation_list, linkage_list, report_list]`

# point_mutation_list: two demensional array for the following information,
```
# [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
```
# linkage_list: two demensional array for the following information,
```
# [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
```
# report_list: two demensional array for the following information,
```
# [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
```

# File 'lib/viral_seq/hivdr.rb', line 375

def sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "IN"
  rf_label = 2
  start_codon_number = 53
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:INSTI, start_codon_number)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end

  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV PR region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
PR codon 1-99
RT codon 34-122 (HXB2 2649-2914) and 152-236(3001-3257)
IN codon 53-174 (HXB2 4384-4751)

Examples:

identify SDRMs from a FASTA sequence file of HIV PR sequences obtained after MPID-DR sequencing

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_dr_sequences/pr.fasta')
p_cut_off = my_seqhash.pm
fdr_hash = my_seqhash.fdr
pr_sdrm = my_seqhash.sdrm_hiv_pr(p_cut_off, fdr_hash)
puts "region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label"; pr_sdrm[0].each {|n| puts n.join(',')}
=> region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label
=> PR,396,30,D,N,247,0.62374,0.57398,0.67163,0,
=> PR,396,50,I,V,1,0.00253,6.0e-05,0.01399,0.18905,*
=> PR,396,88,N,D,246,0.62121,0.57141,0.66919,0,

puts "region,tcs_number,linkage,count,%,CI_low,CI_high,label"; pr_sdrm[1].each {|n| puts n.join(',')}
=> region,tcs_number,linkage,count,%,CI_low,CI_high,label
=> PR,396,D30N+N88D,245,0.61869,0.56884,0.66674,
=> PR,396,WT,149,0.37626,0.32837,0.42602,
=> PR,396,D30N,1,0.00253,6.0e-05,0.01399,*
=> PR,396,D30N+I50V+N88D,1,0.00253,6.0e-05,0.01399,*

puts "position,codon,tcs_number," + ViralSeq::AMINO_ACID_LIST.join(","); pr_sdrm[2].each {|n|puts n.join(",")}
=> position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*
=> PR,1,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,2,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,3,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,4,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,5,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,6,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,7,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,8,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,9,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,10,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,11,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,12,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,62.1212,0.0,0.0,0.0,0.0
=> PR,13,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.1313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.8687,0.0,0.0,0.0
=> PR,14,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,15,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.6263,0.0,0.0,0.0
=> PR,16,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,17,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,18,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.5051,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,19,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,20,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,21,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,22,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,23,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,24,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,25,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,26,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,27,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,28,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,29,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,30,396,0.0,0.0,37.6263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,31,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,32,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,33,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,34,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,35,396,0.0,0.0,62.1212,37.6263,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,36,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,37,396,0.0,0.0,37.8788,61.8687,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,38,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,39,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,40,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,41,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,42,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,43,396,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,44,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,45,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,46,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,47,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,48,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,49,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,50,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,51,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,52,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,53,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,54,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,55,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,56,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,57,396,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,58,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,59,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0
=> PR,60,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,61,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,62,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,63,396,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,37.8788,0.0,0.0,61.8687,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,64,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,65,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,66,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,67,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,68,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,69,396,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,70,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,71,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,37.8788,0.0,0.0,0.0
=> PR,72,396,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0
=> PR,73,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,74,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,75,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,76,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,77,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,78,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,79,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,80,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,81,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,82,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,83,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,84,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,85,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,86,396,0.0,0.0,0.0,0.5051,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,87,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,88,396,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,89,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,90,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,91,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,92,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,93,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,94,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,95,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,96,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,97,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,98,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0
=> PR,99,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Parameters:

cutoff (Integer) (defaults to: 0) —

cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function
fdr (Hash) —

hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

(Array) —
three elements ‘[point_mutation_list, linkage_list, report_list]`

# point_mutation_list: two demensional array for the following information,
```
# [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
```
# linkage_list: two demensional array for the following information,
```
# [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
```
# report_list: two demensional array for the following information,
```
# [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
```

# File 'lib/viral_seq/hivdr.rb', line 143

def sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "PR"
  rf_label = 0
  start_codon_number = 1
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:PI)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end
  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV RT region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
RT codon 34-122, 152-236, two regions are linked

Parameters:

cutoff (Integer) (defaults to: 0) —

cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function
fdr (Hash) —

hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

(Array) —
three elements ‘[point_mutation_list, linkage_list, report_list]`

# point_mutation_list: two demensional array for the following information,
```
# [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
```
# linkage_list: two demensional array for the following information,
```
# [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
```
# report_list: two demensional array for the following information,
```
# [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
```

# File 'lib/viral_seq/hivdr.rb', line 237

def sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "RT"
  rf_label = 1
  start_codon_number = 34
  gap = "AGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCAC"

  n_seq = sequences.size
  mut_nrti = {}
  mut_nnrti = {}
  mut_com = []
  r1_aa = {}
  r2_aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    r1 = seq[0,267]
    r2 = seq[267..-1]
    seq = r1 + gap + r2
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)

    r1_aa[name] = s.aa_string[0,89]
    r2_aa[name] = s.aa_string[-85..-1]
    nrti = s.sdrm(:nrti, start_codon_number)
    nnrti = s.sdrm(:nnrti, start_codon_number)
    mut_com << (nrti.merge(nnrti))

    nrti.each do |position,mutation|
      if mut_nrti[position]
        mut_nrti[position][1] << mutation[1]
      else
        mut_nrti[position] = [mutation[0],[]]
        mut_nrti[position][1] << mutation[1]
      end
    end
    nnrti.each do |position,mutation|
      if mut_nnrti[position]
        mut_nnrti[position][1] << mutation[1]
      else
        mut_nnrti[position] = [mutation[0],[]]
        mut_nnrti[position][1] << mutation[1]
      end
    end
  end

  mut_nrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end

  mut_nnrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NNRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end

  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  r1_aa_start = 34
  r2_aa_start = 152

  r1_aa_size = r1_aa.values[0].size - 1
  r2_aa_size = r2_aa.values[0].size - 1

  (0..r1_aa_size).to_a.each do |p|
    aas = []
    r1_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r1_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r1_aa_start += 1
  end

  (0..r2_aa_size).to_a.each do |p|
    aas = []
    r2_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r2_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r2_aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sequence_locator(ref_option = :HXB2) ⇒ `Array` Also known as: loc

sequence locator for SeqHash object, resembling HIV Sequence Locator from LANL

Parameters:

ref_option (Symbol) (defaults to: :HXB2) —

, name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

Returns:

(Array) —

two dimensional array ‘[[],[],[],…]` for each sequence, including the following information:

title of the SeqHash object (String)

sequence taxa (String)

start_location (Integer)

end_location (Integer)

percentage_of_similarity_to_reference_sequence (Float)

containing_indel? (Boolean)

direction (‘forward’ or ‘reverse’)

aligned_input_sequence (String)

aligned_reference_sequence (String)

See Also:

LANL Sequence Locator

# File 'lib/viral_seq/seq_hash.rb', line 936

def sequence_locator(ref_option = :HXB2)
  out_array = []
  dna_seq = self.dna_hash
  title = self.title

  uniq_dna = dna_seq.uniq_hash

  uniq_dna.each do |seq,names|
    s = ViralSeq::Sequence.new('',seq)
    loc1 = s.locator(ref_option)
    s.rc!
    loc2 = s.locator(ref_option)
    loc1[2] >= loc2[2] ? (direction = :+; loc = loc1): (direction = :-; loc = loc2)

    names.each do |name|
      out_array << ([title, name, ref_option.to_s, direction.to_s] + loc)
    end
  end
  return out_array
end

#shannons_entropy(option = :nt) ⇒ `Hash`

calculate Shannon’s entropy, Euler’s number as the base of logarithm

Examples:

caculate entropy from the example file

sequence_file = 'spec/sample_files/sample_sequence_alignment_for_entropy.fasta'
sequence_hash = ViralSeq::SeqHash.aa_fa(sequence_file)
entropy_hash = sequence_hash.shannons_entropy(:aa)
entropy_hash[3]
=> 0.0
entropy_hash[14].round(3)
=> 0.639
# This example is the sample input of LANL Entropy-One
# https://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_one.html?sample_input=1

Parameters:

option (Symbol) (defaults to: :nt) —

the sequence type ‘:nt` or `:aa`

Returns:

(Hash) —

entropy score at each position in the alignment :position => :entropy , # position starts at 1.

See Also:

Entropy(Wikipedia)

# File 'lib/viral_seq/seq_hash.rb', line 768

def shannons_entropy(option = :nt)
  sequences = if option == :aa
                self.aa_hash.values
              else
                self.dna_hash.values
              end
  entropy_hash = {}
  seq_l = sequences[0].size
  (0..(seq_l - 1)).each do |position|
    element = []
    sequences.each do |seq|
      element << seq[position]
    end
    entropy = 0
    element.delete('*')
    element_size = element.size
    element.count_freq.each do |_k,v|
      p = v/element_size.to_f
      entropy += (-p * ::Math.log(p))
    end
    entropy_hash[(position + 1)] = entropy
  end
  return entropy_hash
end

#size ⇒ `Integer`

the size of nt sequence hash of the SeqHash object

Returns:

(Integer) —

size of nt sequence hash of the SeqHash object



174
175
176

# File 'lib/viral_seq/seq_hash.rb', line 174

def size
  self.dna_hash.size
end

#stop_codon(codon_position = 0) ⇒ `Hash`

screen for sequences with stop codons.

Examples:

given a hash of sequences, return a sub-hash with sequences only contains stop codons

my_seqhash = ViralSeq::SeqHash.fa('my_fasta_file.fasta')
my_seqhash.dna_hash
=> {">seq1"=>"ATAAGAACG", ">seq2"=>"ATATGAACG", ">seq3"=>"ATGAGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash = my_seqhash.stop_codon[:with_stop_codon]
stop_codon_seqhash.dna_hash
=> {">seq2"=>"ATATGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash.aa_hash
=> {">seq2"=>"I*T", ">seq4"=>"Y*T", ">seq5"=>"R*T"}
stop_codon_seqhash.title
=> "my_fasta_file_stop"
filtered_seqhash = my_seqhash.stop_codon[:without_stop_codon]
filtered_seqhash.aa_hash
{">seq1"=>"IRT", ">seq3"=>"MRT"}

Parameters:

codon_position (Integer) (defaults to: 0) —

option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

(Hash) —

of two SeqHash objects seqHash, without_stop_codon: seqHash,

# :with_stop_codon : ViralSeq::SeqHash object with stop codons # :without_stop_codon: ViralSeq::SeqHash object without stop codons

# File 'lib/viral_seq/seq_hash.rb', line 381

def stop_codon(codon_position = 0)
  self.translate(codon_position)
  keys = []
  aa_seqs = self.aa_hash
  aa_seqs.uniq_hash.each do |seq,array_of_name|
    keys += array_of_name if seq.include?('*')
  end
  seqhash1 = self.sub(keys)
  seqhash1.title = self.title + "_stop"
  keys2 = aa_seqs.keys - keys
  seqhash2 = self.sub(keys2)
  return {
    with_stop_codon: seqhash1,
    without_stop_codon: seqhash2
  }
end

#sub(keys) ⇒ `SeqHash`

given an Array of sequence tags, return a sub ViralSeq::SeqHash object with the sequence tags

Parameters:

keys (Array) —

array of sequence tags

Returns:

(SeqHash) —

new SeqHash object with sequences of the input keys

# File 'lib/viral_seq/seq_hash.rb', line 341

def sub(keys)
  h1 = {}
  h2 = {}
  h3 = {}

  keys.each do |k|
    dna = self.dna_hash[k]
    next unless dna
    h1[k] = dna
    aa = self.aa_hash[k]
    h2[k] = aa
    qc = self.qc_hash[k]
    h3[k] = qc
  end
  title = self.title
  file = self.file
  ViralSeq::SeqHash.new(h1,h2,h3,title,file)
end

#tn93 ⇒ `Hash`

TN93 distance functionl, tabulate pairwise comparison of sequence pairs in a sequence alignment, nt sequence only

Examples:

calculate TN93 distribution

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.tn93
=> {0=>1, 1=>8, 2=>6}

Returns:

(Hash) —

pairwise distance table in Hash object {:diff => :freq, … } # Note: :diff in different positions (Integer), not percentage.

# File 'lib/viral_seq/seq_hash.rb', line 843

def tn93
  sequences = self.dna_hash.values
  diff = []
  seq_hash = sequences.count_freq
  seq_hash.values.each do |v|
    comb = v * (v - 1) / 2
    comb.times {diff << 0}
  end

  seq_hash.keys.combination(2).to_a.each do |pair|
    s1 = pair[0]
    s2 = pair[1]
    diff_temp = s1.compare_with(s2)
    comb = seq_hash[s1] * seq_hash[s2]
    comb.times {diff << diff_temp}
  end

  count_diff = diff.count_freq
  out_hash = Hash.new(0)
  Hash[count_diff.sort_by{|k,_v|k}].each do |k,v|
    out_hash[k] = v
  end
  return out_hash
end

#to_rsphylip ⇒ `String`

generate sequences in relaxed sequencial phylip format from a ViralSeq::SeqHash object

Examples:

convert fasta format to relaxed sequencial phylip format

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
puts my_seqhash.to_rsphylip
  #  3 9
  # seq1       ATAAGAACG
  # seq2       ATATGAACG
  # seq3       ATGAGAACG

Returns:

(String) —

relaxed sequencial phylip format in a String object

# File 'lib/viral_seq/seq_hash.rb', line 266

def to_rsphylip
  seqs = self.dna_hash
  outline = "\s" + seqs.size.to_s + "\s" + seqs.values[0].size.to_s + "\n"
  names = seqs.keys
  names.collect!{|n| n.tr(">", "")}
  max_name_l = names.max.size
  max_name_l > 10 ? name_block_l = max_name_l : name_block_l = 10
  seqs.each do |k,v|
    outline += k + "\s" * (name_block_l - k.size + 2) + v.scan(/.{1,10}/).join("\s") + "\n"
  end
  return outline
end

#translate(codon_position = 0) ⇒ `NilClass`

translate the DNA sequences in @dna_hash to amino acid sequences. generate value for @aa_hash

Examples:

translate dna sequences from a FASTA format sequence file

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
my_seqhash.translate
my_seqhash.aa_sequence
=> {">seq1"=>"IRT", ">seq2"=>"I*T", ">seq3"=>"MRT"}

Parameters:

codon_position (Integer) (defaults to: 0) —

option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

(NilClass)

# File 'lib/viral_seq/seq_hash.rb', line 295

def translate(codon_position = 0)
  seqs = self.dna_hash
  @aa_hash = {}
  seqs.uniq_hash.each do |seq, array_of_name|
    s = ViralSeq::Sequence.new('name', seq)
    s.translate(codon_position)
    array_of_name.each do |name|
      @aa_hash[name] = s.aa_string
    end
  end
  return nil
end

#trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false) ⇒ `ViralSeq::SeqHash`

trim dna sequences based on the provided reference coordinates.

Parameters:

start_nt (Integer, Range, Array) —

start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array
end_nt (Integer, Range, Array) —

end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array
ref_option (Symbol) (defaults to: :HXB2) —

, name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`
path_to_muscle (String) (defaults to: false) —

, path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:

(ViralSeq::SeqHash) —

a new ViralSeq::SeqHash object with trimmed sequences

# File 'lib/viral_seq/seq_hash.rb', line 1339

def trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.uniq_hash
  trimmed_seq_hash = {}
  seq_hash_unique.each do |seq, names|
    trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option, path_to_muscle).dna
    names.each do |name|
      trimmed_seq_hash[name] = trimmed_seq
    end
  end
  return_seq_hash = self.dup
  return_seq_hash.dna_hash = trimmed_seq_hash
  return return_seq_hash
end

#uniq_dna_hash(tag = "sequence") ⇒ `ViralSeq::SeqHash` Also known as: uniq

collapse @dna_hash to unique sequence hash. sequences will be named as (tag + “_” + order(Integer) + “_” + counts(Integer))

Examples:

dna_hash = {'>seq1' => 'AAAA','>seq2' => 'AAAA', '>seq3' => 'AAAA', '>seq4' => 'CCCC', '>seq5' => 'CCCC', '>seq6' => 'TTTT'} }
a_seq_hash = ViralSeq::SeqHash.new
a_seq_hash.dna_hash = dna_hash
uniq_sequence = a_seq_hash.uniq_dna_hash('master')
=> {">master_1_3"=>"AAAA", ">master_2_2"=>"CCCC", ">master_3_1"=>"TTTT"}

Parameters:

tag (defaults to: "sequence") —

# the master tag for unique sequences,

Returns:

(ViralSeq::SeqHash) —

new SeqHash object of unique sequence hash

# File 'lib/viral_seq/seq_hash.rb', line 319

def uniq_dna_hash(tag = "sequence")
  seqs = self.dna_hash
  uni = seqs.values.count_freq
  new_seq = {}
  n = 1
  uni.each do |s,c|
    name = ">" + tag + "_" + n.to_s + "_" + c.to_s
    new_seq[name] = s
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new(new_seq)
  seq_hash.title = self.title + "_uniq"
  seq_hash.file = self.file
  return seq_hash
end

#write_nt_fa(file) ⇒ `NilClass`

write the nt sequences to a FASTA format file

Parameters:

file (String) —

path to the FASTA output file

Returns:

(NilClass)

# File 'lib/viral_seq/seq_hash.rb', line 240

def write_nt_fa(file)
  File.open(file, 'w') do |f|
    self.dna_hash.each do |k,v|
      f.puts k
      f.puts v
    end
  end
end

Class: ViralSeq::SeqHash

Overview

Examples:

read a FASTA sequence file of HIV PR sequences, make alignment, perform the QC location check, filter sequences with stop codons and APOBEC3g/f hypermutations, calculate pairwise diversity, calculate minority cut-off based on Poisson model, and examine for drug resistance mutations.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ SeqHash

Instance Attribute Details

#aa_hash ⇒ Hash

#dna_hash ⇒ Hash

#file ⇒ String

#qc_hash ⇒ Hash

#title ⇒ String

Class Method Details

.new_from_aa_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: aa_fa

.new_from_array(seq_array, master_tag = 'seq') ⇒ ViralSeq::SeqHash Also known as: array

.new_from_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: fa

Examples:

new ViralSeq::SeqHash object from a FASTA file

.new_from_fastq(fastq_file) ⇒ ViralSeq::SeqHash Also known as: fq

Examples:

new ViralSeq::SeqHash object from a FASTQ file

Instance Method Details

#+(sh2) ⇒ ViralSeq::SeqHash

#a3g_hypermut(ref = nil) ⇒ Hash Also known as: a3g

Examples:

identify apobec3gf mutations from a sequence fasta file

identify apobec3gf mutations from another sequence fasta file

#align(algorithm = :PPP, path_to_muscle = false) ⇒ SeqHash

#check_nt_size ⇒ Hash

#collapse(cutoff = 1) ⇒ ViralSeq::SeqHash

#complete_with_ref(region_config) ⇒ Object

#consensus(cutoff = 0.5) ⇒ String

Examples:

consensus sequence from an array of sequences.

#drm(region_config) ⇒ Object

#error_table(ref = self.consensus, head = true) ⇒ Array

Examples:

error table for an array of sequences

#fdr(error_rate = 0.0001) ⇒ Hash

Examples:

calculate FDR for mutations that appeared twice in the sample dataset

#filter_for_drm(region_config) ⇒ Object

#filter_similar_pid(cutoff = 10) ⇒ ViralSeq::SeqHash

#gap_strip(option = :nt) ⇒ ViralSeq::SeqHash

Examples:

gap strip for an array of sequences

#gap_strip_ends(option = :nt) ⇒ ViralSeq::SeqHash

Examples:

gap strip for an array of sequences only at the ends

#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

Examples:

QC for sequences in a FASTA files

#mutation(error_rate = 0.01) ⇒ ViralSeq::SeqHash

#nt_range(range) ⇒ ViralSeq::SeqHash

#nt_variants ⇒ Hash

#nucleotide_pi ⇒ Float Also known as: pi

Examples:

calculate π

#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ Integer Also known as: pm

Examples:

obtain Poisson minority cut-off from the example sequence FASTA file.

#qc_indel ⇒ Hash

#random_select(n = 100) ⇒ ViralSeq::SeqHash

#sample(n = 1) ⇒ ViralSeq::SeqHash

#sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

#sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

Examples:

identify SDRMs from a FASTA sequence file of HIV PR sequences obtained after MPID-DR sequencing

#sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

#sequence_locator(ref_option = :HXB2) ⇒ Array Also known as: loc

#shannons_entropy(option = :nt) ⇒ Hash

Examples:

caculate entropy from the example file

#size ⇒ Integer

#stop_codon(codon_position = 0) ⇒ Hash

Examples:

given a hash of sequences, return a sub-hash with sequences only contains stop codons

#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ `SeqHash`

#aa_hash ⇒ `Hash`

#dna_hash ⇒ `Hash`

#file ⇒ `String`

#qc_hash ⇒ `Hash`

#title ⇒ `String`

.new_from_aa_fasta(infile) ⇒ `ViralSeq::SeqHash` Also known as: aa_fa

.new_from_array(seq_array, master_tag = 'seq') ⇒ `ViralSeq::SeqHash` Also known as: array

.new_from_fasta(infile) ⇒ `ViralSeq::SeqHash` Also known as: fa

.new_from_fastq(fastq_file) ⇒ `ViralSeq::SeqHash` Also known as: fq

#+(sh2) ⇒ `ViralSeq::SeqHash`

#a3g_hypermut(ref = nil) ⇒ `Hash` Also known as: a3g

#align(algorithm = :PPP, path_to_muscle = false) ⇒ `SeqHash`

#check_nt_size ⇒ `Hash`

#collapse(cutoff = 1) ⇒ `ViralSeq::SeqHash`

#complete_with_ref(region_config) ⇒ `Object`

#consensus(cutoff = 0.5) ⇒ `String`

#drm(region_config) ⇒ `Object`

#error_table(ref = self.consensus, head = true) ⇒ `Array`

#fdr(error_rate = 0.0001) ⇒ `Hash`

#filter_for_drm(region_config) ⇒ `Object`

#filter_similar_pid(cutoff = 10) ⇒ `ViralSeq::SeqHash`

#gap_strip(option = :nt) ⇒ `ViralSeq::SeqHash`

#gap_strip_ends(option = :nt) ⇒ `ViralSeq::SeqHash`

#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ `ViralSeq::SeqHash`

#mutation(error_rate = 0.01) ⇒ `ViralSeq::SeqHash`

#nt_range(range) ⇒ `ViralSeq::SeqHash`

#nt_variants ⇒ `Hash`

#nucleotide_pi ⇒ `Float` Also known as: pi

#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ `Integer` Also known as: pm

#qc_indel ⇒ `Hash`

#random_select(n = 100) ⇒ `ViralSeq::SeqHash`

#sample(n = 1) ⇒ `ViralSeq::SeqHash`

#sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

#sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

#sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ `Array`

#sequence_locator(ref_option = :HXB2) ⇒ `Array` Also known as: loc

#shannons_entropy(option = :nt) ⇒ `Hash`

#size ⇒ `Integer`

#stop_codon(codon_position = 0) ⇒ `Hash`

#sub(keys) ⇒ `SeqHash`

#tn93 ⇒ `Hash`

#to_rsphylip ⇒ `String`

#translate(codon_position = 0) ⇒ `NilClass`

#trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false) ⇒ `ViralSeq::SeqHash`

#uniq_dna_hash(tag = "sequence") ⇒ `ViralSeq::SeqHash` Also known as: uniq

#write_nt_fa(file) ⇒ `NilClass`