Class: ViralSeq::SeqHash

Inherits:
Object
  • Object
show all
Defined in:
lib/viral_seq/seq_hash.rb,
lib/viral_seq/hivdr.rb

Overview

ViralSeq::SeqHash class for operation on multiple sequences.

Examples:

read a FASTA sequence file of HIV PR sequences, make alignment, perform the QC location check, filter sequences with stop codons and APOBEC3g/f hypermutations, calculate pairwise diversity, calculate minority cut-off based on Poisson model, and examine for drug resistance mutations.

my_pr_seqhash = ViralSeq::SeqHash.fa('my_pr_fasta_file.fasta')
  # new ViralSeq::SeqHash object from a FASTA file
aligned_pr_seqhash = my_pr_seqhash.align
  # align with MUSCLE
filtered_seqhash = aligned_pr_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
  # filter nt sequences with the reference coordinates
filtered_seqhash = aligned_pr_seqhash.stop_codon[:without_stop_codon]
  # return a new ViralSeq::SeqHash object without stop codons
filtered_seqhash = filtered_seqhash.a3g[:filtered_seq]
  # further filter out sequences with A3G hypermutations
filtered_seqhash.pi
  # return pairwise diveristy π
cut_off = filtered_seqhash.pm
  # return cut-off for minority variants based on Poisson model
filtered_seqhash.sdrm_hiv_pr(cut_off)
  # examine for drug resistance mutations for PR region.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ SeqHash

initialize a ViralSeq::SeqHash object



25
26
27
28
29
30
31
# File 'lib/viral_seq/seq_hash.rb', line 25

def initialize (dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "")
  @dna_hash = dna_hash
  @aa_hash = aa_hash
  @qc_hash = qc_hash
  @title = title
  @file = file
end

Instance Attribute Details

#aa_hashHash

Returns Hash object for :name => :amino_acid_sequence_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :amino_acid_sequence_string pairs



37
38
39
# File 'lib/viral_seq/seq_hash.rb', line 37

def aa_hash
  @aa_hash
end

#dna_hashHash

Returns Hash object for :name => :sequence_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :sequence_string pairs



34
35
36
# File 'lib/viral_seq/seq_hash.rb', line 34

def dna_hash
  @dna_hash
end

#fileString

Returns the file that is used to initialize SeqHash object, if it exists.

Returns:

  • (String)

    the file that is used to initialize SeqHash object, if it exists



47
48
49
# File 'lib/viral_seq/seq_hash.rb', line 47

def file
  @file
end

#qc_hashHash

Returns Hash object for :name => :qc_score_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :qc_score_string pairs



40
41
42
# File 'lib/viral_seq/seq_hash.rb', line 40

def qc_hash
  @qc_hash
end

#titleString

default as the file basename if SeqHash object is initialized using ::fa or ::fq

Returns:

  • (String)

    the title of the SeqHash object.



44
45
46
# File 'lib/viral_seq/seq_hash.rb', line 44

def title
  @title
end

Class Method Details

.new_from_aa_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: aa_fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file of amino acid sequences

Parameters:

  • infile (String)

    path to the FASTA format sequence file of aa sequences

Returns:



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/viral_seq/seq_hash.rb', line 82

def self.new_from_aa_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.aa_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_array(seq_array, master_tag = 'seq') ⇒ ViralSeq::SeqHash Also known as: array

initialize a ViralSeq::SeqHash object with an array of sequence strings

Parameters:

  • master_tag (String) (defaults to: 'seq')

    master tag to put in the sequence names

Returns:



150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/viral_seq/seq_hash.rb', line 150

def self.new_from_array(seq_array,master_tag = 'seq')
  n = 1
  hash = {}
  seq_array.each do |seq|
    hash[master_tag + "_" + n.to_s] = seq
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = hash
  seq_hash.title = master_tag
  return seq_hash
end

.new_from_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTA file

ViralSeq::SeqHash.fa('my_fasta_file.fasta')

Parameters:

  • infile (String)

    path to the FASTA format sequence file

Returns:



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/viral_seq/seq_hash.rb', line 55

def self.new_from_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_fastq(fastq_file) ⇒ ViralSeq::SeqHash Also known as: fq

initialize a new ViralSeq::SeqHash object from a FASTQ format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTQ file

ViralSeq::SeqHash.fq('my_fastq_file.fastq')

Parameters:

  • fastq_file (String)

    path to the FASTA format sequence file

Returns:



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/viral_seq/seq_hash.rb', line 111

def self.new_from_fastq(fastq_file)
  count = 0
  sequence_a = []
  quality_a = []
  count_seq = 0

  File.open(fastq_file,'r') do |file|
    file.readlines.collect do |line|
      line.tr!("\u0000","")
      next if line == "\n"
      count +=1
      count_m = count % 4
      if count_m == 1
        line.tr!('@','>')
        sequence_a << line.chomp
        quality_a << line.chomp
        count_seq += 1
      elsif count_m == 2
        sequence_a << line.chomp
      elsif count_m == 0
        quality_a << line.chomp
      end
    end
  end
  sequence_hash = Hash[sequence_a.each_slice(2).to_a]
  quality_hash = Hash[quality_a.each_slice(2).to_a]

  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = sequence_hash
  seq_hash.qc_hash = quality_hash
  seq_hash.title = File.basename(fastq_file,".*")
  seq_hash.file = fastq_file
  return seq_hash
end

Instance Method Details

#+(sh2) ⇒ ViralSeq::SeqHash

combine SeqHash objects

Parameters:

Returns:



182
183
184
185
186
187
188
189
190
# File 'lib/viral_seq/seq_hash.rb', line 182

def +(sh2)
  new_seqhash = ViralSeq::SeqHash.new
  new_seqhash.dna_hash = self.dna_hash.merge(sh2.dna_hash)
  new_seqhash.aa_hash = self.aa_hash.merge(sh2.aa_hash)
  new_seqhash.qc_hash = self.qc_hash.merge(sh2.qc_hash)
  new_seqhash.title = self.title + "_with_" + sh2.title
  new_seqhash.file = self.file + "," + sh2.file
  return new_seqhash
end

#a3g_hypermut(ref = nil) ⇒ Hash Also known as: a3g

function to determine if the sequences have APOBEC3g/f hypermutation.

# APOBEC3G/F pattern: GRD -> ARD
# control pattern: G[YN|RC] -> A[YN|RC]
# use the sample consensus to determine potential a3g sites (default) or provide external reference sequences as a `String`
# Two criteria to identify hypermutation
# 1. Fisher's exact test on the frequencies of G to A mutation at A3G positions vs. non-A3G positions
# 2. Poisson distribution of G to A mutations at A3G positions, outliers sequences
# note:  criteria 2 only applies on a sequence file containing more than 20 sequences,
#        b/c Poisson model does not do well on small sample size.

Examples:

identify apobec3gf mutations from a sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence1.fasta')
hypermut = my_seqhash.a3g
hypermut[:a3g_seq].dna_hash.keys
=> [">Seq7", ">Seq14"]
hypermut[:filtered_seq].dna_hash.keys
=> [">Seq1", ">Seq2", ">Seq5"]
hypermut[:stats]
=> [[">Seq7", 23, 68, 1, 54, 18.26, 4.308329383112348e-06], [">Seq14", 45, 68, 9, 54, 3.97, 5.2143571971582974e-08]]

identify apobec3gf mutations from another sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence2.fasta')
hypermut = my_seqhash.a3g
hypermut[:stats]
=> [[">CTAACACTCA_134_a3g-sample2", 4, 35, 0, 51, Infinity, 0.02465676660128911], [">ATAGTGCCCA_60_a3g-sample2", 4, 35, 1, 51, 5.83, 0.1534487353839561]]
# notice sequence ">ATAGTGCCCA_60_a3g-sample2" has a p value at 0.15, greater than 0.05,
# but it is still called as hypermutation sequence b/c it's Poisson outlier sequence.

Returns:

  • (Hash)

    three paris. :a3g_seq: a ViralSeq:SeqHash object for sequences with hypermutations :filtered_seq : a ViralSeq:SeqHash object for sequences without hypermutations :stats : a two-demensional array ‘[[a,b], [c,d]]` for statistic_info, including the following information,

    # sequence tag
    # G to A mutation numbers at potential a3g positions
    # total potential a3g G positions
    # G to A mutation numbers at non a3g positions
    # total non a3g G positions
    # a3g G to A mutation rate / non-a3g G to A mutation rate
    # Fishers Exact P-value
    

See Also:



489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
# File 'lib/viral_seq/seq_hash.rb', line 489

def a3g_hypermut(ref = nil)
  # mut_hash number of apobec3g/f mutations per sequence
  mut_hash = {}
  hm_hash = {}
  out_hash = {}

  # total G->A mutations at apobec3g/f positions.
  total = 0

  unless ref
    # make consensus sequence for the input sequence hash
    ref = self.consensus
  end

  # obtain apobec3g positions and control positions
  apobec = apobec3gf(ref)
  mut = apobec[0]
  control = apobec[1]

  self.dna_hash.each do |k,v|
    a = 0 # muts
    b = 0 # potential mut sites
    c = 0 # control muts
    d = 0 # potenrial controls
    mut.each do |n|
      if v[n] == "A"
        a += 1
        b += 1
      else
        b += 1
      end
    end
    mut_hash[k] = a
    total += a

    control.each do |n|
      if v[n] == "A"
        c += 1
        d += 1
      else
        d += 1
      end
    end
    rr = (a/b.to_f)/(c/d.to_f)

    t1 = b - a
    t2 = d - c

    fet = ViralSeq::Rubystats::FishersExactTest.new
    fisher = fet.calculate(t1,t2,a,c)
    perc = fisher[:twotail]
    info = [k, a, b, c, d, rr.round(2), perc]
    out_hash[k] = info
    if perc < 0.05
      hm_hash[k] = info
    end
  end

  if self.dna_hash.size > 200
    rate = total.to_f/(self.dna_hash.size)
    count_mut = mut_hash.values.count_freq
    maxi_count = count_mut.values.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate,maxi_count).poisson_hash
    cut_off = 0
    poisson_hash.each do |k,v|
      cal = self.dna_hash.size * v
      obs = count_mut[k]
      if obs >= 20 * cal
        cut_off = k
        break
      elsif k == maxi_count
        cut_off = maxi_count
      end
    end
    mut_hash.each do |k,v|
      if v > cut_off
        hm_hash[k] = out_hash[k]
      end
    end
  end

  hm_seq_hash = ViralSeq::SeqHash.new
  hm_hash.each do |k,_v|
    hm_seq_hash.dna_hash[k] = self.dna_hash[k]
  end

  hm_seq_hash.title = self.title + "_hypermut"
  hm_seq_hash.file = self.file
  filtered_seq_hash = self.sub(self.dna_hash.keys - hm_hash.keys)
  return { a3g_seq: hm_seq_hash,
           filtered_seq: filtered_seq_hash,
           stats: hm_hash.values
          }
end

#align(algorithm = :PPP, path_to_muscle = false) ⇒ SeqHash

align the @dna_hash sequences, return a new ViralSeq::SeqHash object with aligned @dna_hash using MUSCLE

Parameters:

  • algorithm (Symbol) (defaults to: :PPP)

    , algorithm for MUSCLE5 only. Choose from :PPP or :Super5.

  • path_to_muscle (String) (defaults to: false)

    , path to MUSCLE excutable. if not provided (as default), it will use RubyGem::MuscleBio

Returns:

  • (SeqHash)

    new SeqHash object of the aligned @dna_hash, the title has “_aligned”



720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
# File 'lib/viral_seq/seq_hash.rb', line 720

def align(algorithm = :PPP, path_to_muscle = false)
  seq_hash = self.dna_hash
  if self.file.size > 0
    temp_dir = File.dirname(self.file)
  else
    temp_dir=File.dirname($0)
  end

  temp_file = File.join(temp_dir, "_temp_muscle_in")
  temp_aln = File.join(temp_dir, "_temp_muscle_aln")
  File.open(temp_file, 'w'){|f| seq_hash.each {|k,v| f.puts k; f.puts v}}
  if path_to_muscle
    unless ViralSeq::Muscle.check_muscle?(path_to_muscle)
      File.unlink(temp_file)
      return nil
    end
    print `#{path_to_muscle} -in #{temp_file} -out #{temp_aln} -quiet`
  else
    if MuscleBio::VERSION.to_f < 0.5
      MuscleBio.run("muscle -in #{temp_file} -out #{temp_aln} -quiet")
    else
      MuscleBio.exec(temp_file, temp_aln, algorithm)
    end
  end
  out_seq_hash = ViralSeq::SeqHash.fa(temp_aln)
  out_seq_hash.title = self.title + "_aligned"
  out_seq_hash.file = self.file
  File.unlink(temp_file)
  File.unlink(temp_aln)
  return out_seq_hash
end

#check_nt_sizeHash

check the size range of the DNA sequences of the SeqHash object

Returns:

  • (Hash)

    Hash of MAX_SIZE, min: MIN_SIZE



227
228
229
230
231
232
233
234
# File 'lib/viral_seq/seq_hash.rb', line 227

def check_nt_size
  dna_hash = self.dna_hash
  size_array = []
  dna_hash.values.each do |v|
    size_array << v.size
  end
  return { max: size_array.max, min: size_array.min }
end

#collapse(cutoff = 1) ⇒ ViralSeq::SeqHash

Collapse sequences by difference cut-offs. Suggesting aligning before using this function.

Parameters:

  • cutoff (Integer) (defaults to: 1)

    nt base differences. collapse sequences within [cutoff] differences

Returns:



1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
# File 'lib/viral_seq/seq_hash.rb', line 1030

def collapse(cutoff=1)
  seq_array = self.dna_hash.values
  new_seq_freq = {}
  seq_freq = seq_array.count_freq
  if seq_freq.size == 1
    new_seq_freq = seq_freq
  else
    uniq_seq = seq_freq.keys
    unique_seq_pair = uniq_seq.combination(2)
    dupli_seq = []
    unique_seq_pair.each do |pair|
      seq1 = pair[0]
      seq2 = pair[1]
      diff = seq1.compare_with(seq2)
      if diff <= cutoff
        freq1 = seq_freq[seq1]
        freq2 = seq_freq[seq2]
        freq1 >= freq2 ? dupli_seq << seq2 : dupli_seq << seq1
      end
    end

    seq_freq.each do |seq,freq|
      unless dupli_seq.include?(seq)
        new_seq_freq[seq] = freq
      end
    end
  end
  seqhash = ViralSeq::SeqHash.new
  n = 1
  new_seq_freq.each do |seq,freq|
    name = ">seq_" + n.to_s + '_' + freq.to_s
    seqhash.dna_hash[name] = seq
    n += 1
  end
  return seqhash
end

#complete_with_ref(region_config) ⇒ Object



531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
# File 'lib/viral_seq/hivdr.rb', line 531

def complete_with_ref(region_config)
  complete_seqs = {}
  seq_coord = region_config.seq_coord

  ref = ViralSeq::RefSeq.get(region_config.ref_info["ref_type"].to_sym)
  a = region_config.ref_info["ref_coord"][0]
  b = region_config.ref_info["ref_coord"][1]
  c = seq_coord["minimum"]
  d = seq_coord["maximum"]

  if seq_coord["gap"]
    e = seq_coord["gap"]["minimum"]
    f = seq_coord["gap"]["maximum"]

    self.dna_hash.each do |k,v|
      complete_seqs[k] = ref[(a-1)..(c-2)] + v[0,(e-c)] + ref[(e-1)..(f-1)] + v[(e-c)..-1] + ref[d..(b-1)]
    end
  else
    self.dna_hash.each do |k,v|
      complete_seqs[k] = ref[(a-1)..(c-2)] + v + ref[d..(b-1)]
    end
  end

  return ViralSeq::SeqHash.new(complete_seqs)
end

#consensus(cutoff = 0.5) ⇒ String

create one consensus sequence from @dna_hash with an optional majority cut-off for mixed bases.

Examples:

consensus sequence from an array of sequences.

seq_array = %w{ ATTTTTTTTT
                AATTTTTTTT
                AAATTTTTTT
                AAAATTTTTT
                AAAAATTTTT
                AAAAAATTTT
                AAAAAAATTT
                AAAAAAAATT
                AAAAAAAAAT
                AAAAAAAAAA }
my_seqhash = ViralSeq::SeqHash.array(seq_array)
my_seqhash.consensus
=> 'AAAAAWTTTT'
my_seqhash.consensus(0.7)
=> 'AAAANNNTTT'

Parameters:

  • cutoff (Float) (defaults to: 0.5)

    majority cut-off for calling consensus bases. defult at (0.5), position with 15% “A” and 85% “G” will be called as “G” with 20% cut-off and as “R” with 10% cut-off. Using (0) will return use simply majority rule (no cutoff)

Returns:

  • (String)

    consensus sequence



419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# File 'lib/viral_seq/seq_hash.rb', line 419

def consensus(cutoff = 0.5)
  seq_array = self.dna_hash.values
  seq_length = seq_array[0].size
  seq_size = seq_array.size
  consensus_seq = ""
  (0..(seq_length - 1)).each do |position|
    all_base = []
    seq_array.each do |seq|
      if seq[position]
        all_base << seq[position]
      end
    end
    base_count = all_base.count_freq
    max_base_list = []

    if cutoff.zero?
      max_count = base_count.values.max
      max_base_hash = base_count.select {|_k,v| v == max_count}
      max_base_list = max_base_hash.keys
    else
      base_count.each do |k,v|
        if v/seq_size.to_f >= cutoff
          max_base_list << k
        end
      end
    end
    consensus_seq += call_consensus_base(max_base_list)
  end
  return consensus_seq
end

#drm(region_config) ⇒ Object

function to interpret HIV drms with ViralSeq::DrmRegionConfig as a param.



560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
# File 'lib/viral_seq/hivdr.rb', line 560

def drm(region_config)
  region = region_config.region
  fdr_hash = self.fdr # must run fdr before the completion of the sequences

  complete_gene = self.complete_with_ref(region_config)
  sequences = complete_gene.dna_hash

  n_seq = sequences.size
  aa = {}
  mut = {}
  mut_com = []
  point_mutation_list = []

  drm_list = region_config.drm_list

  sequences.each do |name, seq|
    s = ViralSeq::Sequence.new(name, seq)
    s.translate
    aa[name] = s.aa_string

    records_per_seq = {}

    drm_list.each do |drm_class, list|

      mut[drm_class] = {}  if !mut[drm_class]

      record = s.check_drm(list)
      records_per_seq = records_per_seq.merge(record)

      record.each do |position, mutation|
        if !mut[drm_class][position]
          mut[drm_class][position] = [mutation[0],[]]
        end
        mut[drm_class][position][1] << mutation[1]
      end
    end

    mut_com << records_per_seq.sort.to_h
  end

  mut.each do |drm_class, mutations|
    mutations.each do |position, mutation|
      wt = mutation[0]
      mut_list = mutation[1]
      count_mut_list = mut_list.count_freq
      count_mut_list.each do |m,number|
        ci = ViralSeq::Math::BinomCI.new(number, n_seq)
        fdr = fdr_hash[number].round(5)
        label = fdr >= 0.05 ? "*" : ""
        point_mutation_list << [drm_class, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
      end
    end
  end

  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = 1

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]

end

#error_table(ref = self.consensus, head = true) ⇒ Array

return an table of frequencies of nucleotides at each position.

Examples:

error table for an array of sequences

array = %w{ AACCGGTT
            AGCCGGTT
            AACTGCTT
            AACCGTTA
            AACCGGTA }
my_seqhash = ViralSeq::SeqHash.array(array)
my_seqhash.error_table.each {|r| puts r.join(',')}
  position,consensus,total_seq_number,A,C,G,T
  1,A,5,-,,,
  2,A,5,-,,0.2,
  3,C,5,,-,,
  4,C,5,,-,,0.2
  5,G,5,,,-,
  6,G,5,,0.2,-,0.2
  7,T,5,,,,-
  8,T,5,0.4,,,-

Parameters:

  • ref (String) (defaults to: self.consensus)

    a reference sequence to compare with, default as the sample consensus sequence

  • head (Boolean) (defaults to: true)

    if the head of table is included.

Returns:

  • (Array)

    a two-dimension array of the frequency table, including the following info:

    position on the sequence (starting from 1)
    consensus nucleotide
    total sequence numbers
    percentage of A, shows "-" if agrees with consensus
    percentage of C, shows "-" if agrees with consensus
    percentage of G, shows "-" if agrees with consensus
    percentage of T, shows "-" if agrees with consensus
    


1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
# File 'lib/viral_seq/seq_hash.rb', line 1246

def error_table(ref = self.consensus, head = true)

  table = []
  if head
    table << %w{
      position
      consensus
      total_seq_number
      A
      C
      G
      T
    }
  end
  ref_size = ref.size

  (0..(ref_size - 1)).each do |position|
    ref_base = ref[position]
    nts = []

    self.dna_hash.each do |_k,v|
      nts << v[position]
    end

    freq = nts.count_freq
    freq2 = {}

    freq.each do |nt,c|
      if nt == ref_base
        freq2[nt] = '-'
      else
        freq2[nt] = (c/(self.size).to_f)
      end
    end

    table << [(position + 1),ref_base,self.size,freq2['A'],freq2['C'],freq2['G'],freq2['T']]
  end

  return table

end

#fdr(error_rate = 0.0001) ⇒ Hash

calculate false detection rate for minority mutations Credit: Prof. Michael G. Hudgens from UNC-CH for providing the method for fdr calculation

Examples:

calculate FDR for mutations that appeared twice in the sample dataset

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_sequence_for_poisson.fasta')
fdr_hash = my_seqhash.fdr
fdr_hash[2].round(5)
=> 0.00726 # means that mutations appear twice have 0.007261748 chance to be caused by residual errors.

Parameters:

  • error_rate (Float) (defaults to: 0.0001)

    estimated sequencing error rate

Returns:

  • (Hash)

    pair of mutation frequency to false detection rate. (freq => fdr)



632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
# File 'lib/viral_seq/seq_hash.rb', line 632

def fdr(error_rate = 0.0001)
  sequences = self.dna_hash.values
  if sequences.size == 0
    return {}
  else
    seq_count = self.size
    observed_hash = variant_for_poisson(sequences)
    p_unadjusted = []
    observed_hash.each do |k, v|
      p_value = 1 - `Rscript -e "cat(pbinom(#{k}-1, #{seq_count}, #{error_rate}))"`.to_f # compute unadjusted exact p-value, ie under null, probability of observing observed_hash[k] or more extreme
      p_unadjusted += Array.new(v, p_value)
    end
    p_fdr = `Rscript -e "cat(p.adjust(c(#{p_unadjusted.join(',')}), 'fdr'))"`.split("\s").count_freq.to_a # controls fdr. aka Benjamini-Hochberg correction
    vars_pair = observed_hash.to_a
    fdr_hash = Hash.new(0)
    (0..(p_fdr.size - 1)).each do |i|
      fdr_hash[vars_pair[i][0]] = p_fdr[i][0].to_f
    end
    return fdr_hash
  end
end

#filter_for_drm(region_config) ⇒ Object

wrapper function for #a3g_hypermut and #stop_codon with ViralSeq::DrmRegionConfig as a param.



466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
# File 'lib/viral_seq/hivdr.rb', line 466

def filter_for_drm(region_config)
  seq_coord = region_config.seq_coord
  reading_frame_number = region_config.get_reading_frame_number

  if !seq_coord["gap"]

    a3g_check = self.a3g
    a3g_seqs = a3g_check[:a3g_seq]
    a3g_filtered_seqs = a3g_check[:filtered_seq]

    stop_codon_check = a3g_filtered_seqs.stop_codon(reading_frame_number[0])
    stop_codon_seqs = stop_codon_check[:with_stop_codon]
    filtered_seqs = stop_codon_check[:without_stop_codon]

    return {
      filtered_seq: filtered_seqs,
      a3g_seq: a3g_seqs,
      stop_codon_seq: stop_codon_seqs
    }

  else

    r1_length, r2_length = region_config.r1_r2_length.values

    r1_seqs = {}
    r2_seqs = {}

    self.dna_hash.each do |k,v|
        r1_seqs[k] = v[0,r1_length]
        r2_seqs[k] = v[r1_length, r2_length]
    end

    r1_sh = ViralSeq::SeqHash.new(r1_seqs)
    r2_sh = ViralSeq::SeqHash.new(r2_seqs)

    a3g_seqs_r1 = r1_sh.a3g[:a3g_seq]
    a3g_seqs_r2 = r2_sh.a3g[:a3g_seq]

    stop_codon_r1 = r1_sh.stop_codon(reading_frame_number[0])[:with_stop_codon]
    stop_codon_r2 = r2_sh.stop_codon(reading_frame_number[1])[:with_stop_codon]

    a3g_seq_keys = (a3g_seqs_r1.dna_hash.keys | a3g_seqs_r2.dna_hash.keys)
    a3g_seqs = ViralSeq::SeqHash.new(self.dna_hash.select {|k, _v| a3g_seq_keys.include? k})

    stop_codon_keys = (stop_codon_r1.dna_hash.keys | stop_codon_r2.dna_hash.keys)
    stop_codon_seqs = ViralSeq::SeqHash.new(self.dna_hash.select {|k, _v| stop_codon_keys.include? k})

    reject_keys = (a3g_seq_keys | stop_codon_keys)

    filtered_seqs = ViralSeq::SeqHash.new(self.dna_hash.reject { |k, _v| reject_keys.include? k })

    return {
      filtered_seq: filtered_seqs,
      a3g_seq: a3g_seqs,
      stop_codon_seq: stop_codon_seqs
    }

  end

end

#filter_similar_pid(cutoff = 10) ⇒ ViralSeq::SeqHash

Remove sequences with residual offspring Primer IDs.

Compare PID with sequences which have identical sequences.
PIDs differ by 1 base will be recognized. If PID1 is x time (cutoff) greater than PID2, PID2 will be disgarded.
  each sequence tag starting with ">" and the Primer ID sequence
  followed by the number of Primer ID appeared in the raw sequence
  the information sections in the tags are separated by underscore "_"
  example sequence tag: >AGGCGTAGA_32_sample1_RT

Parameters:

  • cutoff (Integer) (defaults to: 10)

    the fold cut-off to remove the potential residual offspring Primer IDs

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object without sqeuences containing residual offspring Primer ID



968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
# File 'lib/viral_seq/seq_hash.rb', line 968

def filter_similar_pid(cutoff = 10)
  seq = self.dna_hash.dup
  uni_seq = seq.values.uniq
  uni_seq_pid = {}
  uni_seq.each do |k|
    seq.each do |name,s|
      name = name[1..-1]
      if k == s
        if uni_seq_pid[k]
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        else
          uni_seq_pid[k] = []
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        end
      end
    end
  end

  dup_pid = []
  uni_seq_pid.values.each do |v|
    next if v.size == 1
    pid_hash = Hash[v]
    list = pid_hash.keys
    list2 = Array.new(list)
    pairs = []

    list.each do |k|
      list2.delete(k)
      list2.each do |k1|
        pairs << [k,k1]
      end
    end

    pairs.each do |p|
      pid1 = p[0]
      pid2 = p[1]
      if pid1.compare_with(pid2) <= 1
        n1 = pid_hash[pid1].to_i
        n2 = pid_hash[pid2].to_i
        if n1 >= cutoff * n2
          dup_pid << pid2
        elsif n2 >= cutoff * n1
          dup_pid << pid1
        end
      end
    end
  end

  new_seq = {}
  seq.each do |name,s|
    pid = name.split("_")[0][1..-1]
    unless dup_pid.include?(pid)
      new_seq[name] = s
    end
  end
  self.sub(new_seq.keys)
end

#gap_strip(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment, all positions that contains gaps (‘-’) will be removed

Examples:

gap strip for an array of sequences

array = ["AACCGGTT", "A-CCGGTT", "AAC-GGTT", "AACCG-TT", "AACCGGT-"]
array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip.dna_hash.values
  ACGT
  ACGT
  ACGT
  ACGT
  ACGT

Parameters:

  • option (Symbol) (defaults to: :nt)

    sequence options for ‘:nt` or `:aa`

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object containing nt or aa sequences without gaps



1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
# File 'lib/viral_seq/seq_hash.rb', line 1085

def gap_strip(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option `#{option}` not recognized"
  end

  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end

  seq_matrix.delete_if do |_p, list|
    list.include?("-")
  end

  sequence_alignment.each do |n,s|
    new_s = ""
    seq_matrix.keys.each {|p| new_s += s[p]}
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#gap_strip_ends(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment at both ends, only positions at the ends that contains gaps (‘-’) will be removed.

Examples:

gap strip for an array of sequences only at the ends

array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip_ends.dna_hash.values
  AACCGGT
  A-CCGGT
  AAC-GGT
  AACCG-T
  AACCGGT

Parameters:

  • option (Symbol) (defaults to: :nt)

    sequence options for ‘:nt` or `:aa`

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object containing nt or aa sequences without gaps at the ends



1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
# File 'lib/viral_seq/seq_hash.rb', line 1144

def gap_strip_ends(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option #{option} not recognized"
  end
  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end
  n1 = 0
  n2 = 0
  seq_matrix.each do |_p, list|
    if list.include?("-")
      n1 += 1
    else
      break
    end
  end

  seq_matrix.keys.reverse.each do |p|
    list = seq_matrix[p]
    if list.include?("-")
      n2 += 1
    else
      break
    end
  end

  sequence_alignment.each do |n,s|
    new_s = s[n1..(- n2 - 1)]
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

quality check for HIV sequences based on ViralSeq::Sequence#locator, check if sequences are in the target range

Examples:

QC for sequences in a FASTA files

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_seq.fasta')
filtered_seqhash = my_seqhash.hiv_seq_qc([4384,4386], 4750..4752, false, :HXB2)
my_seqhash.dna_hash.size
=> 6
filtered_seqhash.dna_hash.size
=> 4

Parameters:

  • start_nt (Integer, Range, Array)

    start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array

  • end_nt (Integer, Range, Array)

    end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array

  • indel (Boolean) (defaults to: true)

    allow indels or not, ‘ture` or `false`

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

  • path_to_muscle (String) (defaults to: false)

    , path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:

  • (ViralSeq::SeqHash)

    a new ViralSeq::SeqHash object with only the sequences that meet the QC criterias



883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
# File 'lib/viral_seq/seq_hash.rb', line 883

def hiv_seq_qc(start_nt, end_nt, indel=true, ref_option = :HXB2, path_to_muscle = false)
  start_nt = start_nt..start_nt if start_nt.is_a?(Integer)
  end_nt = end_nt..end_nt if end_nt.is_a?(Integer)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.values.uniq
  seq_hash_unique_pass = []

  seq_hash_unique.each do |seq|
    next if seq.nil?
    loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
    next unless loc # if locator tool fails, skip this seq.
    if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
      if indel
        seq_hash_unique_pass << seq
      elsif loc[3] == false
        seq_hash_unique_pass << seq
      end
    end
  end
  seq_pass = []
  seq_hash_unique_pass.each do |seq|
    seq_hash.each do |seq_name, orginal_seq|
      if orginal_seq == seq
        seq_pass << seq_name
        seq_hash.delete(seq_name)
      end
    end
  end
  self.sub(seq_pass)
end

#mutation(error_rate = 0.01) ⇒ ViralSeq::SeqHash

mutate @dna_hash based on the error_rate

Parameters:

  • error_rate (Float) (defaults to: 0.01)

    error rate used to mutate sequences.

Returns:



1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
# File 'lib/viral_seq/seq_hash.rb', line 1203

def mutation(error_rate = 0.01)
  new_seqhash = ViralSeq::SeqHash.new
  dna = {}
  self.dna_hash.each do |name, seq|
    dna[name + '_mut-' + error_rate.to_s] = seq.mutation(error_rate)
  end
  new_seqhash.dna_hash = dna
  new_seqhash.title = self.title + "_mut-" + error_rate.to_s
  new_seqhash.file = self.file
  return new_seqhash
end

#nt_range(range) ⇒ ViralSeq::SeqHash

return a new SeqHash object with given a range on the nt sequence position

Parameters:

  • range (Range)

    range of positions on the nt sequence

Returns:



215
216
217
218
219
220
221
222
# File 'lib/viral_seq/seq_hash.rb', line 215

def nt_range(range)
  dna_hash = self.dna_hash
  new_hash = {}
  dna_hash.each do |k,v|
    new_hash[k] = v[range]
  end
  ViralSeq::SeqHash.new(new_hash)
end

#nt_variantsHash

analysis for the nt sequence variants.

Returns:

  • (Hash)

    An Hash with information of variant analysis. Key/values of the return object see /docs/variants_structure.pdf



657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
# File 'lib/viral_seq/seq_hash.rb', line 657

def nt_variants
  return_obj = {}
  nt_hash = self.dna_hash
  tcs_number = self.size
  dl = ViralSeq::TcsCore.detection_limit(tcs_number)
  fdr_hash = self.fdr
  pm_cut_off = self.pm
  con = self.consensus
  return_obj[:tcs_number] = tcs_number
  return_obj[:lower_detection_limit] = dl
  return_obj[:pm_cut_off] = pm_cut_off
  return_obj[:positions] = []
  cis = {}

  (0..(con.size - 1)).each do |p|
    position_obj = {}
    position_obj[:position] = p + 1
    position_obj[:tcs_number] = tcs_number
    position_obj[:lower_detection_limit] = dl
    position_obj[:pm_cut_off] = (pm_cut_off == Float::INFINITY ? pm_cut_off.to_s : pm_cut_off)

    nts = []
    dna_hash.each do |n,s|
      nts << s[p]
    end
    freq_hash = nts.count_freq
    [:A, :C, :G, :T, :-].each do |k|
      v = freq_hash[k.to_s]
      position_obj[k] = {}
      position_obj[k][:count] = v
      if v > 0
        if cis[[v, tcs_number]]
          ci = cis[[v, tcs_number]]
        else
          ci = ViralSeq::Math::BinomCI.new(v, tcs_number)
          cis[[v, tcs_number]] = ci
        end
        position_obj[k][:freq] = ci.mean.round(4)
        position_obj[k][:freq_ci_low] = ci.lower.round(4)
        position_obj[k][:freq_ci_high] = ci.upper.round(4)
        position_obj[k][:greater_than_pm] = (v >= pm_cut_off ? true : false)
        position_obj[k][:fdr] = fdr_hash[v]
      else
        position_obj[k][:freq] = 0
        position_obj[k][:freq_ci_low] = 0
        position_obj[k][:freq_ci_high] = 0
        position_obj[k][:greater_than_pm] = false
        position_obj[k][:fdr] = nil
      end
    end

    return_obj[:positions] << position_obj
  end

  return_obj
end

#nucleotide_piFloat Also known as: pi

Function to calculate nucleotide diversity π, for nt sequence only

Examples:

calculate π

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.pi
=> 0.16667

Returns:

  • (Float)

    nucleotide diversity π

See Also:



802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
# File 'lib/viral_seq/seq_hash.rb', line 802

def nucleotide_pi
  sequences = self.dna_hash.values
  seq_length = sequences[0].size - 1
  nt_position_hash = {}
  (0..seq_length).each do |n|
    nt_position_hash[n] = []
    sequences.each do |s|
      nt_position_hash[n] << s[n]
    end
  end
  diver = 0
  com = 0
  nt_position_hash.each do |_p,nt|
    nt.delete_if {|n| n =~ /[^A|^C|^G|^T]/}
    next if nt.size == 1
    nt_count = nt.count_freq
    combination = (nt.size)*(nt.size - 1)/2
    com += combination
    a = nt_count["A"]
    c = nt_count["C"]
    t = nt_count["T"]
    g = nt_count["G"]
    div = a*c + a*t + a*g + c*t + c*g + t*g
    diver += div
  end
  pi = (diver/com.to_f).round(5)
  return pi
end

#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ Integer Also known as: pm

Define Poission cut-off for minority variants.

Examples:

obtain Poisson minority cut-off from the example sequence FASTA file.

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_sequence_for_poisson.fasta')
my_seqhash.pm
=> 2 # means that mutations appear at least 2 times are very likely to be a true mutation instead of random methods errors.

Parameters:

  • error_rate (Float) (defaults to: 0.0001)

    estimated sequencing error rate

  • fold_cutoff (Integer) (defaults to: 20)

    a fold cut-off to determine poisson minority cut-off. default = 20. i.e. <5% mutations from random methods error.

Returns:

  • (Integer)

    a cut-off for minority variants (>=).

See Also:



596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
# File 'lib/viral_seq/seq_hash.rb', line 596

def poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20)
  sequences = self.dna_hash.values
  if sequences.size == 0
    return 0
  else
    cut_off = Float::INFINITY
    l = sequences[0].size
    rate = sequences.size * error_rate
    count_mut = variant_for_poisson(sequences)
    max_count = count_mut.keys.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate, max_count).poisson_hash

    poisson_hash.each do |k,v|
      cal = l * v
      obs = count_mut[k] ? count_mut[k] : 1
      if obs >= fold_cutoff * cal
        cut_off = k
        break
      end
    end
    return cut_off
  end
end

#qc_indelHash

QC for each nucleotide sequence comparing with sample consensus for indels

Returns:

  • (Hash)

    object containing two SeqHash seq_hash, has_indel: seq_hash



1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
# File 'lib/viral_seq/seq_hash.rb', line 1313

def qc_indel
  con = self.consensus
  dna_hash = self.dna_hash
  names_passed = []
  names_indel = []
  dna_hash.uniq_hash.each do |seq, names|
    if seq.compare_with(con) < 4
      names_passed += names
    elsif ViralSeq::Muscle.align(con, seq)[0]["-"]
      names_indel += names
    else
      names_passed += names
    end
  end
  return {no_indel: self.sub(names_passed),
    has_indel: self.sub(names_indel)}
end

#random_select(n = 100) ⇒ ViralSeq::SeqHash

randomly select n number of sequences from the orginal SeqHash object

Parameters:

  • n (Integer) (defaults to: 100)

    number of sequences to randomly select

Returns:



1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
# File 'lib/viral_seq/seq_hash.rb', line 1292

def random_select(n = 100)
  new_sh = ViralSeq::SeqHash.new
  dna_hash = self.dna_hash
  aa_hash = self.aa_hash
  qc_hash = self.qc_hash

  keys = dna_hash.keys.sample(n)

  keys.each do |k|
    new_sh.dna_hash[k] = dna_hash[k]
    new_sh.aa_hash[k] = aa_hash[k]
    new_sh.qc_hash[k] = qc_hash[k]
  end
  new_sh.file = self.file
  new_sh.title = self.title + "_" + n.to_s
  return new_sh
end

#sample(n = 1) ⇒ ViralSeq::SeqHash

sample a certain number of sequences from a SeqHash object

Parameters:

  • n (Integer) (defaults to: 1)

    number of sequences to sample

Returns:



196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/viral_seq/seq_hash.rb', line 196

def sample(n = 1)
  keys = self.dna_hash.keys
  sampled_keys = keys.sample(n)
  sampled_nt = {}
  sampled_aa = {}
  sampled_qc = {}
  sampled_title = self.title + "_sampled_" + n.to_s
  sampled_keys.each do |k|
    sampled_nt[k] = self.dna_hash[k]
    sampled_aa[k] = self.aa_hash[k]
    sampled_qc[k] = self.qc_hash[k]
  end
  return ViralSeq::SeqHash.new(sampled_nt, sampled_aa, sampled_qc, sampled_title, self.file)
end

#sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV IN region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
IN codon 53-174

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

  • fdr (Hash)

    hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
# File 'lib/viral_seq/hivdr.rb', line 375

def sdrm_hiv_in(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "IN"
  rf_label = 2
  start_codon_number = 53
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:INSTI, start_codon_number)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end

  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV PR region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
PR codon 1-99
RT codon 34-122 (HXB2 2649-2914) and 152-236(3001-3257)
IN codon 53-174 (HXB2 4384-4751)

Examples:

identify SDRMs from a FASTA sequence file of HIV PR sequences obtained after MPID-DR sequencing

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_dr_sequences/pr.fasta')
p_cut_off = my_seqhash.pm
fdr_hash = my_seqhash.fdr
pr_sdrm = my_seqhash.sdrm_hiv_pr(p_cut_off, fdr_hash)
puts "region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label"; pr_sdrm[0].each {|n| puts n.join(',')}
=> region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label
=> PR,396,30,D,N,247,0.62374,0.57398,0.67163,0,
=> PR,396,50,I,V,1,0.00253,6.0e-05,0.01399,0.18905,*
=> PR,396,88,N,D,246,0.62121,0.57141,0.66919,0,

puts "region,tcs_number,linkage,count,%,CI_low,CI_high,label"; pr_sdrm[1].each {|n| puts n.join(',')}
=> region,tcs_number,linkage,count,%,CI_low,CI_high,label
=> PR,396,D30N+N88D,245,0.61869,0.56884,0.66674,
=> PR,396,WT,149,0.37626,0.32837,0.42602,
=> PR,396,D30N,1,0.00253,6.0e-05,0.01399,*
=> PR,396,D30N+I50V+N88D,1,0.00253,6.0e-05,0.01399,*

puts "position,codon,tcs_number," + ViralSeq::AMINO_ACID_LIST.join(","); pr_sdrm[2].each {|n|puts n.join(",")}
=> position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*
=> PR,1,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,2,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,3,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,4,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,5,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,6,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,7,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,8,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,9,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,10,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,11,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,12,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,62.1212,0.0,0.0,0.0,0.0
=> PR,13,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.1313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.8687,0.0,0.0,0.0
=> PR,14,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,15,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.6263,0.0,0.0,0.0
=> PR,16,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,17,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,18,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.5051,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,19,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,20,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,21,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,22,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,23,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,24,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,25,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,26,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,27,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,28,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,29,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,30,396,0.0,0.0,37.6263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,31,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,32,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,33,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,34,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,35,396,0.0,0.0,62.1212,37.6263,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,36,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,37,396,0.0,0.0,37.8788,61.8687,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,38,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,39,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,40,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,41,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,42,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,43,396,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,44,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,45,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,46,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,47,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,48,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,49,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,50,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,51,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,52,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,53,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,54,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,55,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,56,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,57,396,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,58,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,59,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0
=> PR,60,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,61,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,62,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,63,396,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,37.8788,0.0,0.0,61.8687,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,64,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,65,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,66,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,67,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,68,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,69,396,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,70,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,71,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,37.8788,0.0,0.0,0.0
=> PR,72,396,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0
=> PR,73,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,74,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,75,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,76,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,77,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,78,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,79,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,80,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,81,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,82,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,83,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,84,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,85,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,86,396,0.0,0.0,0.0,0.5051,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,87,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,88,396,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,89,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,90,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,91,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,92,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,93,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,94,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,95,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,96,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,97,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,98,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0
=> PR,99,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

  • fdr (Hash)

    hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/viral_seq/hivdr.rb', line 143

def sdrm_hiv_pr(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "PR"
  rf_label = 0
  start_codon_number = 1
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:PI)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end
  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0)) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV RT region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
RT codon 34-122, 152-236, two regions are linked

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

  • fdr (Hash)

    hash of events => (false detecton rate) can be obtained using ViralSeq::SeqHash#fdr

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,fdr,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
# File 'lib/viral_seq/hivdr.rb', line 237

def sdrm_hiv_rt(cutoff = 0, fdr_hash = Hash.new(0))
  sequences = self.dna_hash
  region = "RT"
  rf_label = 1
  start_codon_number = 34
  gap = "AGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCAC"

  n_seq = sequences.size
  mut_nrti = {}
  mut_nnrti = {}
  mut_com = []
  r1_aa = {}
  r2_aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    r1 = seq[0,267]
    r2 = seq[267..-1]
    seq = r1 + gap + r2
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)

    r1_aa[name] = s.aa_string[0,89]
    r2_aa[name] = s.aa_string[-85..-1]
    nrti = s.sdrm(:nrti, start_codon_number)
    nnrti = s.sdrm(:nnrti, start_codon_number)
    mut_com << (nrti.merge(nnrti))

    nrti.each do |position,mutation|
      if mut_nrti[position]
        mut_nrti[position][1] << mutation[1]
      else
        mut_nrti[position] = [mutation[0],[]]
        mut_nrti[position][1] << mutation[1]
      end
    end
    nnrti.each do |position,mutation|
      if mut_nnrti[position]
        mut_nnrti[position][1] << mutation[1]
      else
        mut_nnrti[position] = [mutation[0],[]]
        mut_nnrti[position][1] << mutation[1]
      end
    end
  end

  mut_nrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end

  mut_nnrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      fdr = fdr_hash[number].round(5)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NNRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), fdr, label]
    end
  end

  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  r1_aa_start = 34
  r2_aa_start = 152

  r1_aa_size = r1_aa.values[0].size - 1
  r2_aa_size = r2_aa.values[0].size - 1

  (0..r1_aa_size).to_a.each do |p|
    aas = []
    r1_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r1_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r1_aa_start += 1
  end

  (0..r2_aa_size).to_a.each do |p|
    aas = []
    r2_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r2_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r2_aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sequence_locator(ref_option = :HXB2) ⇒ Array Also known as: loc

sequence locator for SeqHash object, resembling HIV Sequence Locator from LANL

Parameters:

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

Returns:

  • (Array)

    two dimensional array ‘[[],[],[],…]` for each sequence, including the following information:

    title of the SeqHash object (String)

    sequence taxa (String)

    start_location (Integer)

    end_location (Integer)

    percentage_of_similarity_to_reference_sequence (Float)

    containing_indel? (Boolean)

    direction (‘forward’ or ‘reverse’)

    aligned_input_sequence (String)

    aligned_reference_sequence (String)

See Also:



936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
# File 'lib/viral_seq/seq_hash.rb', line 936

def sequence_locator(ref_option = :HXB2)
  out_array = []
  dna_seq = self.dna_hash
  title = self.title

  uniq_dna = dna_seq.uniq_hash

  uniq_dna.each do |seq,names|
    s = ViralSeq::Sequence.new('',seq)
    loc1 = s.locator(ref_option)
    s.rc!
    loc2 = s.locator(ref_option)
    loc1[2] >= loc2[2] ? (direction = :+; loc = loc1): (direction = :-; loc = loc2)

    names.each do |name|
      out_array << ([title, name, ref_option.to_s, direction.to_s] + loc)
    end
  end
  return out_array
end

#shannons_entropy(option = :nt) ⇒ Hash

calculate Shannon’s entropy, Euler’s number as the base of logarithm

Examples:

caculate entropy from the example file

sequence_file = 'spec/sample_files/sample_sequence_alignment_for_entropy.fasta'
sequence_hash = ViralSeq::SeqHash.aa_fa(sequence_file)
entropy_hash = sequence_hash.shannons_entropy(:aa)
entropy_hash[3]
=> 0.0
entropy_hash[14].round(3)
=> 0.639
# This example is the sample input of LANL Entropy-One
# https://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_one.html?sample_input=1

Parameters:

  • option (Symbol) (defaults to: :nt)

    the sequence type ‘:nt` or `:aa`

Returns:

  • (Hash)

    entropy score at each position in the alignment :position => :entropy , # position starts at 1.

See Also:



768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
# File 'lib/viral_seq/seq_hash.rb', line 768

def shannons_entropy(option = :nt)
  sequences = if option == :aa
                self.aa_hash.values
              else
                self.dna_hash.values
              end
  entropy_hash = {}
  seq_l = sequences[0].size
  (0..(seq_l - 1)).each do |position|
    element = []
    sequences.each do |seq|
      element << seq[position]
    end
    entropy = 0
    element.delete('*')
    element_size = element.size
    element.count_freq.each do |_k,v|
      p = v/element_size.to_f
      entropy += (-p * ::Math.log(p))
    end
    entropy_hash[(position + 1)] = entropy
  end
  return entropy_hash
end

#sizeInteger

the size of nt sequence hash of the SeqHash object

Returns:

  • (Integer)

    size of nt sequence hash of the SeqHash object



174
175
176
# File 'lib/viral_seq/seq_hash.rb', line 174

def size
  self.dna_hash.size
end

#stop_codon(codon_position = 0) ⇒ Hash

screen for sequences with stop codons.

Examples:

given a hash of sequences, return a sub-hash with sequences only contains stop codons

my_seqhash = ViralSeq::SeqHash.fa('my_fasta_file.fasta')
my_seqhash.dna_hash
=> {">seq1"=>"ATAAGAACG", ">seq2"=>"ATATGAACG", ">seq3"=>"ATGAGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash = my_seqhash.stop_codon[:with_stop_codon]
stop_codon_seqhash.dna_hash
=> {">seq2"=>"ATATGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash.aa_hash
=> {">seq2"=>"I*T", ">seq4"=>"Y*T", ">seq5"=>"R*T"}
stop_codon_seqhash.title
=> "my_fasta_file_stop"
filtered_seqhash = my_seqhash.stop_codon[:without_stop_codon]
filtered_seqhash.aa_hash
{">seq1"=>"IRT", ">seq3"=>"MRT"}

Parameters:

  • codon_position (Integer) (defaults to: 0)

    option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

  • (Hash)

    of two SeqHash objects seqHash, without_stop_codon: seqHash,

    # :with_stop_codon : ViralSeq::SeqHash object with stop codons # :without_stop_codon: ViralSeq::SeqHash object without stop codons



381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# File 'lib/viral_seq/seq_hash.rb', line 381

def stop_codon(codon_position = 0)
  self.translate(codon_position)
  keys = []
  aa_seqs = self.aa_hash
  aa_seqs.uniq_hash.each do |seq,array_of_name|
    keys += array_of_name if seq.include?('*')
  end
  seqhash1 = self.sub(keys)
  seqhash1.title = self.title + "_stop"
  keys2 = aa_seqs.keys - keys
  seqhash2 = self.sub(keys2)
  return {
    with_stop_codon: seqhash1,
    without_stop_codon: seqhash2
  }
end

#sub(keys) ⇒ SeqHash

given an Array of sequence tags, return a sub ViralSeq::SeqHash object with the sequence tags

Parameters:

  • keys (Array)

    array of sequence tags

Returns:

  • (SeqHash)

    new SeqHash object with sequences of the input keys



341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
# File 'lib/viral_seq/seq_hash.rb', line 341

def sub(keys)
  h1 = {}
  h2 = {}
  h3 = {}

  keys.each do |k|
    dna = self.dna_hash[k]
    next unless dna
    h1[k] = dna
    aa = self.aa_hash[k]
    h2[k] = aa
    qc = self.qc_hash[k]
    h3[k] = qc
  end
  title = self.title
  file = self.file
  ViralSeq::SeqHash.new(h1,h2,h3,title,file)
end

#tn93Hash

TN93 distance functionl, tabulate pairwise comparison of sequence pairs in a sequence alignment, nt sequence only

Examples:

calculate TN93 distribution

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.tn93
=> {0=>1, 1=>8, 2=>6}

Returns:

  • (Hash)

    pairwise distance table in Hash object {:diff => :freq, … } # Note: :diff in different positions (Integer), not percentage.



843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
# File 'lib/viral_seq/seq_hash.rb', line 843

def tn93
  sequences = self.dna_hash.values
  diff = []
  seq_hash = sequences.count_freq
  seq_hash.values.each do |v|
    comb = v * (v - 1) / 2
    comb.times {diff << 0}
  end

  seq_hash.keys.combination(2).to_a.each do |pair|
    s1 = pair[0]
    s2 = pair[1]
    diff_temp = s1.compare_with(s2)
    comb = seq_hash[s1] * seq_hash[s2]
    comb.times {diff << diff_temp}
  end

  count_diff = diff.count_freq
  out_hash = Hash.new(0)
  Hash[count_diff.sort_by{|k,_v|k}].each do |k,v|
    out_hash[k] = v
  end
  return out_hash
end

#to_rsphylipString

generate sequences in relaxed sequencial phylip format from a ViralSeq::SeqHash object

Examples:

convert fasta format to relaxed sequencial phylip format

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
puts my_seqhash.to_rsphylip
  #  3 9
  # seq1       ATAAGAACG
  # seq2       ATATGAACG
  # seq3       ATGAGAACG

Returns:

  • (String)

    relaxed sequencial phylip format in a String object



266
267
268
269
270
271
272
273
274
275
276
277
# File 'lib/viral_seq/seq_hash.rb', line 266

def to_rsphylip
  seqs = self.dna_hash
  outline = "\s" + seqs.size.to_s + "\s" + seqs.values[0].size.to_s + "\n"
  names = seqs.keys
  names.collect!{|n| n.tr(">", "")}
  max_name_l = names.max.size
  max_name_l > 10 ? name_block_l = max_name_l : name_block_l = 10
  seqs.each do |k,v|
    outline += k + "\s" * (name_block_l - k.size + 2) + v.scan(/.{1,10}/).join("\s") + "\n"
  end
  return outline
end

#translate(codon_position = 0) ⇒ NilClass

translate the DNA sequences in @dna_hash to amino acid sequences. generate value for @aa_hash

Examples:

translate dna sequences from a FASTA format sequence file

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
my_seqhash.translate
my_seqhash.aa_sequence
=> {">seq1"=>"IRT", ">seq2"=>"I*T", ">seq3"=>"MRT"}

Parameters:

  • codon_position (Integer) (defaults to: 0)

    option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

  • (NilClass)


295
296
297
298
299
300
301
302
303
304
305
306
# File 'lib/viral_seq/seq_hash.rb', line 295

def translate(codon_position = 0)
  seqs = self.dna_hash
  @aa_hash = {}
  seqs.uniq_hash.each do |seq, array_of_name|
    s = ViralSeq::Sequence.new('name', seq)
    s.translate(codon_position)
    array_of_name.each do |name|
      @aa_hash[name] = s.aa_string
    end
  end
  return nil
end

#trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

trim dna sequences based on the provided reference coordinates.

Parameters:

  • start_nt (Integer, Range, Array)

    start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array

  • end_nt (Integer, Range, Array)

    end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

  • path_to_muscle (String) (defaults to: false)

    , path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:



1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
# File 'lib/viral_seq/seq_hash.rb', line 1339

def trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.uniq_hash
  trimmed_seq_hash = {}
  seq_hash_unique.each do |seq, names|
    trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option, path_to_muscle).dna
    names.each do |name|
      trimmed_seq_hash[name] = trimmed_seq
    end
  end
  return_seq_hash = self.dup
  return_seq_hash.dna_hash = trimmed_seq_hash
  return return_seq_hash
end

#uniq_dna_hash(tag = "sequence") ⇒ ViralSeq::SeqHash Also known as: uniq

collapse @dna_hash to unique sequence hash. sequences will be named as (tag + “_” + order(Integer) + “_” + counts(Integer))

Examples:

dna_hash = {'>seq1' => 'AAAA','>seq2' => 'AAAA', '>seq3' => 'AAAA', '>seq4' => 'CCCC', '>seq5' => 'CCCC', '>seq6' => 'TTTT'} }
a_seq_hash = ViralSeq::SeqHash.new
a_seq_hash.dna_hash = dna_hash
uniq_sequence = a_seq_hash.uniq_dna_hash('master')
=> {">master_1_3"=>"AAAA", ">master_2_2"=>"CCCC", ">master_3_1"=>"TTTT"}

Parameters:

  • tag (defaults to: "sequence")

    # the master tag for unique sequences,

Returns:



319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/viral_seq/seq_hash.rb', line 319

def uniq_dna_hash(tag = "sequence")
  seqs = self.dna_hash
  uni = seqs.values.count_freq
  new_seq = {}
  n = 1
  uni.each do |s,c|
    name = ">" + tag + "_" + n.to_s + "_" + c.to_s
    new_seq[name] = s
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new(new_seq)
  seq_hash.title = self.title + "_uniq"
  seq_hash.file = self.file
  return seq_hash
end

#write_nt_fa(file) ⇒ NilClass

write the nt sequences to a FASTA format file

Parameters:

  • file (String)

    path to the FASTA output file

Returns:

  • (NilClass)


240
241
242
243
244
245
246
247
# File 'lib/viral_seq/seq_hash.rb', line 240

def write_nt_fa(file)
  File.open(file, 'w') do |f|
    self.dna_hash.each do |k,v|
      f.puts k
      f.puts v
    end
  end
end