Class: Bio::UniProtKB

Inherits:
EMBLDB show all
Includes:
EMBLDB::Common
Defined in:
lib/bio/db/embl/uniprotkb.rb

Overview

Description

Parser class for UniProtKB/SwissProt and TrEMBL database entry.

See the UniProtKB document files and manuals.

Examples

str = File.read("p53_human.swiss")
obj = Bio::UniProtKB.new(str)
obj.entry_id #=> "P53_HUMAN"

References

Direct Known Subclasses

TrEMBL

Constant Summary collapse

@@entry_regrexp =
/[A-Z0-9]{1,4}_[A-Z0-9]{1,5}/
@@data_class =
["STANDARD", "PRELIMINARY"]
@@ac_regrexp =

Bio::EMBLDB::Common#ac -> ary

#accessions  -> ary
#accession  -> String (accessions.first)
/[OPQ][0-9][A-Z0-9]{3}[0-9]/
@@cc_topics =
['PHARMACEUTICAL',
'BIOTECHNOLOGY',
'TOXIC DOSE', 
'ALLERGEN',   
'RNA EDITING',
'POLYMORPHISM',
'BIOPHYSICOCHEMICAL PROPERTIES',
'MASS SPECTROMETRY',
'WEB RESOURCE', 
'ENZYME REGULATION',
'DISEASE',
'INTERACTION',
'DEVELOPMENTAL STAGE',
'INDUCTION',
'CAUTION',
'ALTERNATIVE PRODUCTS',
'DOMAIN',
'PTM',
'MISCELLANEOUS',
'TISSUE SPECIFICITY',
'COFACTOR',
'PATHWAY',
'SUBUNIT',
'CATALYTIC ACTIVITY',
'SUBCELLULAR LOCATION',
'FUNCTION',
'SIMILARITY']
@@dr_database_identifier =

returns databases cross-references in the DR lines.

  • Bio::UniProtKB#dr -> Hash w/in Array

DR Line; defabases cross-reference (>=0)

  DR  database_identifier; primary_identifier; secondary_identifier.
a cross_ref pre one line
['EMBL','CARBBANK','DICTYDB','ECO2DBASE',
'ECOGENE',
'FLYBASE','GCRDB','HIV','HSC-2DPAGE','HSSP','INTERPRO','MAIZEDB',
'MAIZE-2DPAGE','MENDEL','MGD''MIM','PDB','PFAM','PIR','PRINTS',
'PROSITE','REBASE','AARHUS/GHENT-2DPAGE','SGD','STYGENE','SUBTILIST',
'SWISS-2DPAGE','TIGR','TRANSFAC','TUBERCULIST','WORMPEP','YEPD','ZFIN']

Constants included from EMBLDB::Common

EMBLDB::Common::DELIMITER, EMBLDB::Common::RS, EMBLDB::Common::TAGSIZE

Instance Method Summary collapse

Methods included from EMBLDB::Common

#ac, #accession, #initialize, #kw, #oc, #og

Methods inherited from EMBLDB

#initialize

Methods inherited from DB

#exists?, #fetch, #get, open, #tags

Instance Method Details

#cc(topic = nil) ⇒ Object

returns contents in the CC lines.

  • Bio::UniProtKB#cc -> Hash

returns an object of contents in the TOPIC.

  • Bio::UniProtKB#cc(TOPIC) -> Array w/in Hash, Hash

returns contents of the “ALTERNATIVE PRODUCTS”.

  • Bio::UniProtKB#cc(‘ALTERNATIVE PRODUCTS’) -> Hash

    {'Event' => str, 
     'Named isoforms' => int,  
     'Comment' => str,
     'Variants'=>[{'Name' => str, 'Synonyms' => str, 'IsoId' => str, 'Sequence' => []}]}
    
    CC   -!- ALTERNATIVE PRODUCTS:
    CC       Event=Alternative splicing; Named isoforms=15;
    ...
    CC         placentae isoforms. All tissues differentially splice exon 13;
    CC       Name=A; Synonyms=no del;
    CC         IsoId=P15529-1; Sequence=Displayed;
    

returns contents of the “DATABASE”.

  • Bio::UniProtKB#cc(‘DATABASE’) -> Array

    [{'NAME'=>str,'NOTE'=>str, 'WWW'=>URI,'FTP'=>URI}, ...]
    
    CC   -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].
    

returns contents of the “MASS SPECTROMETRY”.

  • Bio::UniProtKB#cc(‘MASS SPECTROMETRY’) -> Array

    [{'MW"=>float,'MW_ERR'=>float, 'METHOD'=>str,'RANGE'=>str}, ...]
    
    CC   -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].
    

CC lines (>=0, optional)

CC   -!- TISSUE SPECIFICITY: HIGHEST LEVELS FOUND IN TESTIS. ALSO PRESENT
CC       IN LIVER, KIDNEY, LUNG AND BRAIN.

CC   -!- TOPIC: FIRST LINE OF A COMMENT BLOCK;
CC       SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.

See also www.expasy.org/sprot/userman.html#CC_line



806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
# File 'lib/bio/db/embl/uniprotkb.rb', line 806

def cc(topic = nil)
  unless @data['CC']
    cc  = Hash.new
    comment_border= '-' * (77 - 4 + 1)
    dlm = /-!- /

    # 12KD_MYCSM has no CC lines.
    return cc if get('CC').size == 0
    
    cc_raw = fetch('CC')

    # Removing the copyright statement.
    cc_raw.sub!(/ *---.+---/m, '')

    # Not any CC Lines without the copyright statement.
    return cc if cc_raw == ''

    begin
      cc_raw, copyright = cc_raw.split(/#{comment_border}/)[0]
      _ = copyright #dummy for suppress "assigned but unused variable"
      cc_raw = cc_raw.sub(dlm,'')
      cc_raw.split(dlm).each do |tmp|
        tmp = tmp.strip

        if /(^[A-Z ]+[A-Z]): (.+)/ =~ tmp
          key  = $1
          body = $2
          body.gsub!(/- (?!AND)/,'-')
          body.strip!
          unless cc[key]
            cc[key] = [body]
          else
            cc[key].push(body)
          end
        else
          raise ["Error: [#{entry_id}]: CC Lines", '"', tmp, '"',
                 '', get('CC'),''].join("\n")
        end
      end
    rescue NameError
      if fetch('CC') == ''
        return {}
      else
        raise ["Error: Invalid CC Lines: [#{entry_id}]: ",
               "\n'#{self.get('CC')}'\n", "(#{$!})"].join
      end
    rescue NoMethodError
    end
    
    @data['CC'] = cc
  end


  case topic
  when 'ALLERGEN'
    return @data['CC'][topic]
  when 'ALTERNATIVE PRODUCTS'
    return cc_alternative_products(@data['CC'][topic])
  when 'BIOPHYSICOCHEMICAL PROPERTIES'
    return cc_biophysiochemical_properties(@data['CC'][topic])
  when 'BIOTECHNOLOGY'
    return @data['CC'][topic]
  when 'CATALITIC ACTIVITY'
    return cc_catalytic_activity(@data['CC'][topic])
  when 'CAUTION'
    return cc_caution(@data['CC'][topic])
  when 'COFACTOR'
    return @data['CC'][topic]
  when 'DEVELOPMENTAL STAGE'
    return @data['CC'][topic].join('')
  when 'DISEASE'
    return @data['CC'][topic].join('')
  when 'DOMAIN'
    return @data['CC'][topic]
  when 'ENZYME REGULATION'
    return @data['CC'][topic].join('')
  when 'FUNCTION'
    return @data['CC'][topic].join('')
  when 'INDUCTION'
    return @data['CC'][topic].join('')
  when 'INTERACTION'
    return cc_interaction(@data['CC'][topic])
  when 'MASS SPECTROMETRY'
    return cc_mass_spectrometry(@data['CC'][topic])
  when 'MISCELLANEOUS'
    return @data['CC'][topic]
  when 'PATHWAY'
    return cc_pathway(@data['CC'][topic])
  when 'PHARMACEUTICAL'
    return @data['CC'][topic]
  when 'POLYMORPHISM'
    return @data['CC'][topic]
  when 'PTM'
    return @data['CC'][topic]
  when 'RNA EDITING'
    return cc_rna_editing(@data['CC'][topic])
  when 'SIMILARITY'
    return @data['CC'][topic]
  when 'SUBCELLULAR LOCATION'
    return cc_subcellular_location(@data['CC'][topic])
  when 'SUBUNIT'
    return @data['CC'][topic]
  when 'TISSUE SPECIFICITY'
    return @data['CC'][topic]
  when 'TOXIC DOSE'
    return @data['CC'][topic]
  when 'WEB RESOURCE'
    return cc_web_resource(@data['CC'][topic])
  when 'DATABASE'
    # DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].
    tmp = Array.new
    db = @data['CC']['DATABASE']
    return db unless db

    db.each do |e|
      db = {'NAME' => nil, 'NOTE' => nil, 'WWW' => nil, 'FTP' => nil}
      e.sub(/.$/,'').split(/;/).each do |line|
        case line
        when /NAME=(.+)/
          db['NAME'] = $1
        when /NOTE=(.+)/
          db['NOTE'] = $1
        when /WWW="(.+)"/
          db['WWW'] = $1
        when /FTP="(.+)"/
          db['FTP'] = $1
        end 
      end
      tmp.push(db)
    end
    return tmp
  when nil
    return @data['CC']
  else
    return @data['CC'][topic]
  end
end

#deObject

Returns an Array (for new format since rel 14) or a String (for old format before rel 14) for the DE line.



333
334
335
336
337
338
339
340
341
342
343
# File 'lib/bio/db/embl/uniprotkb.rb', line 333

def de
  return @data['DE'] if @data['DE']
  parsed_de_line = parse_DE_line_rel14(get('DE'))
  case parsed_de_line
  when Array # new format since rel14
    @data['DE'] ||= parsed_de_line
  else
    super
  end
  @data['DE']
end

#dr(key = nil) ⇒ Object

Bio::UniProtKB#dr



1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
# File 'lib/bio/db/embl/uniprotkb.rb', line 1171

def dr(key = nil)
  unless key
    embl_dr
  else
    (embl_dr[key] or []).map {|x|
      {'Accession' => x[0],
       'Version' => x[1],
       ' ' => x[2],
       'Molecular Type' => x[3]}
    }
  end
end

#dt(key = nil) ⇒ Object

returns a Hash of information in the DT lines.

hash keys: 
  ['created', 'sequence', 'annotation']

also Symbols acceptable (ASAP):
  [:created, :sequence, :annotation]

++

Since UniProtKB release 7.0 of 07-Feb-2006, the DT line format is changed, and the word “annotation” is no longer used in DT lines. Despite the change, the word “annotation” is still used for keeping compatibility.

returns a String of information in the DT lines by a given key.

DT Line; date (3/entry)

DT DD-MMM-YYY (integrated into UniProtKB/XXXXX.)
DT DD-MMM-YYY (sequence version NN)
DT DD-MMM-YYY (entry version NN)

The format have been changed in UniProtKB release 7.0 of 07-Feb-2006. Below is the older format.

Old format of DT Line; date (3/entry)

DT DD-MMM-YYY (rel. NN, Created)
DT DD-MMM-YYY (rel. NN, Last sequence update)
DT DD-MMM-YYY (rel. NN, Last annotation update)


157
158
159
160
161
162
163
164
165
166
167
# File 'lib/bio/db/embl/uniprotkb.rb', line 157

def dt(key = nil)
  return dt[key] if key
  return @data['DT'] if @data['DT']

  part = self.get('DT').split(/\n/)
  @data['DT'] = {
    'created'    => part[0].sub(/\w{2}   /,'').strip,
    'sequence'   => part[1].sub(/\w{2}   /,'').strip,
    'annotation' => part[2].sub(/\w{2}   /,'').strip
  }
end

#embl_drObject

Backup Bio::EMBLDB#dr as embl_dr



1168
# File 'lib/bio/db/embl/uniprotkb.rb', line 1168

alias :embl_dr :dr

#entry_idObject Also known as: entry_name, entry

returns a ENTRY_NAME in the ID line.



98
99
100
# File 'lib/bio/db/embl/uniprotkb.rb', line 98

def entry_id
  id_line('ENTRY_NAME')
end

#ft(feature_key = nil) ⇒ Object

returns contents in the feature table.

Examples

sp = Bio::UniProtKB.new(entry)
ft = sp.ft
ft.class #=> Hash
ft.keys.each do |feature_key|
  ft[feature_key].each do |feature|
    feature['From'] #=> '1'
    feature['To']   #=> '21'
    feature['Description'] #=> ''
    feature['FTId'] #=> ''
    feature['diff'] #=> []
    feature['original'] #=> [feature_key, '1', '21', '', '']
  end
end
  • Bio::UniProtKB#ft -> Hash

    {FEATURE_KEY => [{'From' => int, 'To' => int, 
                      'Description' => aStr, 'FTId' => aStr,
                      'diff' => [original_residues, changed_residues],
                      'original' => aAry }],...}
    

returns an Array of the information about the feature_name in the feature table.

  • Bio::UniProtKB#ft(feature_name) -> Array of Hash

    [{'From' => str, 'To' => str, 'Description' => str, 'FTId' => str},...]
    

FT Line; feature table data (>=0, optional)

Col     Data item
-----   -----------------
 1- 2   FT
 6-13   Feature name 
15-20   `FROM' endpoint
22-27   `TO' endpoint
35-75   Description (>=0 per key)
-----   -----------------

Note: ‘FROM’ and ‘TO’ endopoints are allowed to use non-numerial charactors including ‘<’, ‘>’ or ‘?’. (c.f. ‘<1’, ‘?42’)

See also www.expasy.org/sprot/userman.html#FT_line



1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
# File 'lib/bio/db/embl/uniprotkb.rb', line 1236

def ft(feature_key = nil)
  return ft[feature_key] if feature_key
  return @data['FT'] if @data['FT']

  ftstr = get('FT')
  ftlines = ftstr.split("\n")
  for i in 0..10 do
    if /^FT +([^\s]+) +(([^\s]+)\:)?([\<\?]?[0-9]+|\?)(?:\.\.([\>\?]?[0-9]+|\?))?\s*$/ =~ ftlines[i] &&
       /^FT +\/([^\s\=]+)(?:\=(\")?(.+)(\")?)?\s*$/ =~ ftlines[i+1] then
      fmt_2019_11 = true
      break #for i
    end
  end #for i

  hash = if fmt_2019_11 then
           ft_2019_11_parser(ftlines)
         else
           ft_legacy_parser(ftlines)
         end
  @data['FT'] = hash
end

#gene_nameObject

returns a String of the first gene name in the GN line.



448
449
450
# File 'lib/bio/db/embl/uniprotkb.rb', line 448

def gene_name
  (x = self.gene_names) ? x.first : nil
end

#gene_namesObject

returns a Array of gene names in the GN line.



437
438
439
440
441
442
443
444
# File 'lib/bio/db/embl/uniprotkb.rb', line 437

def gene_names
  gn # set @data['GN'] if it hasn't been already done
  if @data['GN'].first.class == Hash then
    @data['GN'].collect { |element| element[:name] }
  else
    @data['GN'].first
  end
end

#gnObject

returns gene names in the GN line.

New UniProt/SwissProt format:

  • Bio::UniProtKB#gn -> [ <gene record>* ]

where <gene record> is:

{ :name => '...', 
  :synonyms => [ 's1', 's2', ... ],
  :loci   => [ 'l1', 'l2', ... ],
  :orfs     => [ 'o1', 'o2', ... ] 
}

Old format:

GN Line: Gene name(s) (>=0, optional)



361
362
363
364
365
366
367
368
369
370
371
# File 'lib/bio/db/embl/uniprotkb.rb', line 361

def gn
  unless @data['GN']
    case fetch('GN')
    when /Name=/,/ORFNames=/,/OrderedLocusNames=/,/Synonyms=/
      @data['GN'] = gn_uniprot_parser
    else
      @data['GN'] = gn_old_parser
    end
  end
  @data['GN']
end

#hiObject

The HI line

Bio::UniProtKB#hi #=> hash



722
723
724
725
726
727
728
729
730
731
732
733
734
735
# File 'lib/bio/db/embl/uniprotkb.rb', line 722

def hi
  unless @data['HI']
    @data['HI'] = []
    fetch('HI').split(/\. /).each do |hlist|
      hash = {'Category' => '',  'Keywords' => [], 'Keyword' => ''}
      hash['Category'], hash['Keywords'] = hlist.split(': ')
      hash['Keywords'] = hash['Keywords'].split('; ')
      hash['Keyword'] = hash['Keywords'].pop
      hash['Keyword'].sub!(/\.$/, '')
      @data['HI'] << hash
    end
  end
  @data['HI']
end

#id_line(key = nil) ⇒ Object

returns a Hash of the ID line.

returns a content (Int or String) of the ID line by a given key. Hash keys: [‘ENTRY_NAME’, ‘DATA_CLASS’, ‘MODECULE_TYPE’, ‘SEQUENCE_LENGTH’]

ID Line (since UniProtKB release 9.0 of 31-Oct-2006)

ID   P53_HUMAN               Reviewed;         393 AA.
#"ID  #{ENTRY_NAME} #{DATA_CLASS}; #{SEQUENCE_LENGTH}."

Examples

obj.id_line  #=> {"ENTRY_NAME"=>"P53_HUMAN", "DATA_CLASS"=>"Reviewed", 
                  "SEQUENCE_LENGTH"=>393, "MOLECULE_TYPE"=>nil}

obj.id_line('ENTRY_NAME') #=> "P53_HUMAN"

ID Line (older style)

ID   P53_HUMAN      STANDARD;      PRT;   393 AA.
#"ID  #{ENTRY_NAME} #{DATA_CLASS}; #{MOLECULE_TYPE}; #{SEQUENCE_LENGTH}."

Examples

obj.id_line  #=> {"ENTRY_NAME"=>"P53_HUMAN", "DATA_CLASS"=>"STANDARD", 
                  "SEQUENCE_LENGTH"=>393, "MOLECULE_TYPE"=>"PRT"}

obj.id_line('ENTRY_NAME') #=> "P53_HUMAN"


73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/bio/db/embl/uniprotkb.rb', line 73

def id_line(key = nil)
  return id_line[key] if key
  return @data['ID'] if @data['ID']

  part = @orig['ID'].split(/ +/)         
  if part[4].to_s.chomp == 'AA.' then
    # after UniProtKB release 9.0 of 31-Oct-2006
    # (http://www.uniprot.org/docs/sp_news.htm)
    molecule_type   = nil
    sequence_length = part[3].to_i
  else
    molecule_type   = part[3].sub(/;/,'')
    sequence_length = part[4].to_i
  end
  @data['ID'] = {
    'ENTRY_NAME'      => part[1],
    'DATA_CLASS'      => part[2].sub(/;/,''),
    'MOLECULE_TYPE'   => molecule_type,
    'SEQUENCE_LENGTH' => sequence_length
  }
end

#moleculeObject Also known as: molecule_type

returns a MOLECULE_TYPE in the ID line.

A short-cut for Bio::UniProtKB#id_line(‘MOLECULE_TYPE’).



108
109
110
# File 'lib/bio/db/embl/uniprotkb.rb', line 108

def molecule
  id_line('MOLECULE_TYPE')
end

#ohObject

The OH Line;

OH NCBI_TaxID=TaxID; HostName. br.expasy.org/sprot/userman.html#OH_line



531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
# File 'lib/bio/db/embl/uniprotkb.rb', line 531

def oh
  unless @data['OH']
    oh = []
    a = fetch('OH').split(/(NCBI\_TaxID\=)(\d+)(\;)/)
    t = catch :error do
      taxid = nil
      host_name = nil
      while x = a.shift
        x = x.to_s.strip
        case x
        when ''
          next
        when 'NCBI_TaxID='
          if taxid then
            oh.push({'NCBI_TaxID' => taxid, 'HostName' => host_name})
            taxid = nil
            host_name = nil
          end
          taxid = a.shift
          throw :error, :missing_semicolon if a.shift != ';'
        else
          throw :error, :missing_taxid if host_name
          host_name = x
          host_name.sub!(/\.\z/, '')
        end
      end #while x...
      if taxid then
        oh.push({'NCBI_TaxID' => taxid, 'HostName' => host_name})
      elsif host_name then
        throw :error, :missing_taxid_last
      end
      nil
    end #t = catch...
    if t then
      raise ArgumentError,
            ["Error: Invalid OH line format (#{self.entry_id}):",
             $!, "\n", get('OH'), "\n"].join
    end
    @data['OH'] = oh
  end
  @data['OH']
end

#os(num = nil) ⇒ Object

returns a Array of Hashs or a String of the OS line when a key given.

  • Bio::EMBLDB#os -> Array

[{'name' => '(Human)', 'os' => 'Homo sapiens'}, 
 {'name' => '(Rat)', 'os' => 'Rattus norveticus'}]
{'name' => "(Human)", 'os' => 'Homo sapiens'}
  • Bio::UniProtKB#os[‘name’] -> “(Human)”

  • Bio::EPTR#os(0) -> “Homo sapiens (Human)”

OS Line; organism species (>=1)

OS   Genus species (name).
OS   Genus species (name0) (name1).
OS   Genus species (name0) (name1).
OS   Genus species (name0), G s0 (name0), and G s (name0) (name1).
OS   Homo sapiens (Human), and Rarrus norveticus (Rat)
OS   Hippotis sp. Clark and Watts 825.
OS   unknown cyperaceous sp.


470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# File 'lib/bio/db/embl/uniprotkb.rb', line 470

def os(num = nil)
  unless @data['OS']
    os = Array.new
    fetch('OS').split(/, and|, /).each do |tmp|
      if tmp =~ /(\w+ *[\w \:\'\+\-\.]+[\w\.])/
        org = $1
        tmp =~ /(\(.+\))/ 
        os.push({'name' => $1, 'os' => org})
      else
        raise "Error: OS Line. #{$!}\n#{fetch('OS')}\n"
      end
    end
    @data['OS'] = os
  end

  if num
    # EX. "Trifolium repens (white clover)"
    return "#{@data['OS'][num]['os']} #{@data['OS'][num]['name']}"
  else
    return @data['OS']
  end
end

#oxObject

returns a Hash of oraganism taxonomy cross-references.

  • Bio::UniProtKB#ox -> Hash

    {'NCBI_TaxID' => ['1234','2345','3456','4567'], ...}
    

OX Line; organism taxonomy cross-reference (>=1 per entry)

OX   NCBI_TaxID=1234;
OX   NCBI_TaxID=1234, 2345, 3456, 4567;


514
515
516
517
518
519
520
521
522
523
524
525
# File 'lib/bio/db/embl/uniprotkb.rb', line 514

def ox
  unless @data['OX']
    tmp = fetch('OX').sub(/\.$/,'').split(/;/).map { |e| e.strip }
    hsh = Hash.new
    tmp.each do |e|
      db,refs = e.split(/=/)
      hsh[db] = refs.split(/, */)
    end
    @data['OX'] = hsh
  end
  return @data['OX']
end

#protein_nameObject

returns the proposed official name of the protein. Returns a String.

Since UniProtKB release 14.0 of 22-Jul-2008, the DE line format have been changed. The method returns the full name which is taken from “RecName: Full=” or “SubName: Full=” line normally in the beginning of the DE lines. Unlike parser for old format, no special treatments for fragment or precursor.

For old format, the method parses the DE lines and returns the protein name as a String.

DE Line; description (>=1)

"DE #{OFFICIAL_NAME} (#{SYNONYM})"
"DE #{OFFICIAL_NAME} (#{SYNONYM}) [CONTEINS: #1; #2]."
OFFICIAL_NAME  1/entry
SYNONYM        >=0
CONTEINS       >=0


250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# File 'lib/bio/db/embl/uniprotkb.rb', line 250

def protein_name
  parsed_de_line = self.de
  if parsed_de_line.kind_of?(Array) then
    # since UniProtKB release 14.0 of 22-Jul-2008
    name = nil
    parsed_de_line.each do |a|
      case a[0]
      when 'RecName', 'SubName'
        if name_pair = a[1..-1].find { |b| b[0] == 'Full' } then
          name = name_pair[1]
          break
        end
      end
    end
    name = name.to_s
  else
    # old format (before Rel. 13.x)
    name = ""
    if de_line = fetch('DE') then
      str = de_line[/^[^\[]*/] # everything preceding the first [ (the "contains" part)
      name = str[/^[^(]*/].strip
      name << ' (Fragment)' if str =~ /fragment/i
    end
  end
  return name
end

#refObject

returns contents in the R lines.

  • Bio::EMBLDB::Common#ref -> [ <refernece information Hash>* ]

where <reference information Hash> is:

{'RN' => '', 'RC' => '', 'RP' => '', 'RX' => '', 
 'RA' => '', 'RT' => '', 'RL' => '', 'RG' => ''}

R Lines

  • RN RC RP RX RA RT RL RG



588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
# File 'lib/bio/db/embl/uniprotkb.rb', line 588

def ref
  unless @data['R']
    @data['R'] = [get('R').split(/\nRN   /)].flatten.map { |str|
      hash = {'RN' => '', 'RC' => '', 'RP' => '', 'RX' => '', 
             'RA' => '', 'RT' => '', 'RL' => '', 'RG' => ''}
      str = 'RN   ' + str unless /^RN   / =~ str

      str.split("\n").each do |line|
        if /^(R[NPXARLCTG])   (.+)/ =~ line
          hash[$1] += $2 + ' '
        else
          raise "Invalid format in R lines, \n[#{line}]\n"
        end
      end

      hash['RN'] = set_RN(hash['RN'])
      hash['RC'] = set_RC(hash['RC'])
      hash['RP'] = set_RP(hash['RP'])
      hash['RX'] = set_RX(hash['RX'])
      hash['RA'] = set_RA(hash['RA'])
      hash['RT'] = set_RT(hash['RT'])
      hash['RL'] = set_RL(hash['RL'])
      hash['RG'] = set_RG(hash['RG'])

      hash
    }

  end
  @data['R']
end

#referencesObject

returns Bio::Reference object from Bio::EMBLDB::Common#ref.

  • Bio::EMBLDB::Common#ref -> Bio::References



682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
# File 'lib/bio/db/embl/uniprotkb.rb', line 682

def references
  unless @data['references']
    ary = self.ref.map {|ent|
      hash = Hash.new('')
      ent.each {|key, value|
        case key
        when 'RA'
          hash['authors'] = value.split(/, /)
        when 'RT'
          hash['title'] = value
        when 'RL'
          if value =~ /(.*) (\d+) \((\d+)\), (\d+-\d+) \((\d+)\)$/
            hash['journal'] = $1
            hash['volume']  = $2
            hash['issue']   = $3
            hash['pages']   = $4
            hash['year']    = $5
          else
            hash['journal'] = value
          end
        when 'RX'  # PUBMED, MEDLINE, DOI
          value.each do |tag, xref|
            hash[ tag.downcase ]  = xref
          end
        end
      }
      Reference.new(hash)
    }
    @data['references'] = References.new(ary)
  end
  @data['references']
end

#seqObject Also known as: aaseq

returns a Bio::Sequence::AA of the amino acid sequence.

  • Bio::UniProtKB#seq -> Bio::Sequence::AA

blank Line; sequence data (>=1)



1464
1465
1466
1467
1468
1469
# File 'lib/bio/db/embl/uniprotkb.rb', line 1464

def seq
  unless @data['']
    @data[''] = Sequence::AA.new( fetch('').gsub(/ |\d+/,'') )
  end
  return @data['']
end

#sequence_lengthObject Also known as: aalen

returns a SEQUENCE_LENGTH in the ID line.

A short-cut for Bio::UniProtKB#id_line(‘SEQUENCE_LENGHT’).



117
118
119
# File 'lib/bio/db/embl/uniprotkb.rb', line 117

def sequence_length
  id_line('SEQUENCE_LENGTH')
end

#set_RN(data) ⇒ Object



619
620
621
# File 'lib/bio/db/embl/uniprotkb.rb', line 619

def set_RN(data)
  data.strip
end

#sq(key = nil) ⇒ Object

returns a Hash of conteins in the SQ lines.

  • Bio::UniProtKBL#sq -> hsh

returns a value of a key given in the SQ lines.

  • Bio::UniProtKBL#sq(key) -> int or str

  • Keys: [‘MW’, ‘mw’, ‘molecular’, ‘weight’, ‘aalen’, ‘len’, ‘length’,

    'CRC64']
    

SQ Line; sequence header (1/entry)

SQ   SEQUENCE   233 AA;  25630 MW;  146A1B48A1475C86 CRC64;
SQ   SEQUENCE  \d+ AA; \d+ MW;  [0-9A-Z]+ CRC64;

MW, Dalton unit. CRC64 (64-bit Cyclic Redundancy Check, ISO 3309).



1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
# File 'lib/bio/db/embl/uniprotkb.rb', line 1436

def sq(key = nil)
  unless @data['SQ']
    if fetch('SQ') =~ /(\d+) AA\; (\d+) MW; (.+) CRC64;/
      @data['SQ'] = { 'aalen' => $1.to_i, 'MW' => $2.to_i, 'CRC64' => $3 }
    else
      raise "Invalid SQ Line: \n'#{fetch('SQ')}'"
    end
  end

  if key
    case key
    when /mw/, /molecular/, /weight/
      @data['SQ']['MW']
    when /len/, /length/, /AA/
      @data['SQ']['aalen']
    else
      @data['SQ'][key]
    end
  else 
    @data['SQ']
  end
end

#synonymsObject

returns synonyms (unofficial and/or alternative names). Returns an Array containing String objects.

Since UniProtKB release 14.0 of 22-Jul-2008, the DE line format have been changed. The method returns the full or short names which are taken from “RecName: Short=”, “RecName: EC=”, and AltName lines, except after “Contains:” or “Includes:”. For keeping compatibility with old format parser, “RecName: EC=N.N.N.N” is reported as “EC N.N.N.N”. In addition, to prevent confusion, “Allergen=” and “CD_antigen=” prefixes are added for the corresponding fields.

For old format, the method parses the DE lines and returns synonyms. synonyms are each placed in () following the official name on the DE line.



291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
# File 'lib/bio/db/embl/uniprotkb.rb', line 291

def synonyms
  ary = Array.new
  parsed_de_line = self.de
  if parsed_de_line.kind_of?(Array) then
    # since UniProtKB release 14.0 of 22-Jul-2008
    parsed_de_line.each do |a|
      case a[0]
      when 'Includes', 'Contains'
        break #the each loop
      when 'RecName', 'SubName', 'AltName'
        a[1..-1].each do |b|
          if name = b[1] and b[1] != self.protein_name then
            case b[0]
            when 'EC'
              name = "EC " + b[1]
            when 'Allergen', 'CD_antigen'
              name = b[0] + '=' + b[1]
            else
              name = b[1]
            end
            ary.push name
          end
        end
      end #case a[0]
    end #parsed_de_line.each
  else
    # old format (before Rel. 13.x)
    if de_line = fetch('DE') then
      line = de_line.sub(/\[.*\]/,'') # ignore stuff between [ and ].  That's the "contains" part
    line.scan(/\([^)]+/) do |synonym| 
      unless synonym =~ /fragment/i then 
        ary << synonym[1..-1].strip # index to remove the leading (  
      end
      end
    end
  end
  return ary
end