Class: Bio::UniProtKB
- Includes:
- EMBLDB::Common
- Defined in:
- lib/bio/db/embl/uniprotkb.rb
Overview
Description
Parser class for UniProtKB/SwissProt and TrEMBL database entry.
See the UniProtKB document files and manuals.
Examples
str = File.read("p53_human.swiss")
obj = Bio::UniProtKB.new(str)
obj.entry_id #=> "P53_HUMAN"
References
-
The UniProt Knowledgebase (UniProtKB) www.uniprot.org/help/uniprotkb
-
The Universal Protein Resource (UniProt) uniprot.org/
-
The UniProtKB/SwissProt/TrEMBL User Manual www.uniprot.org/docs/userman.html
Direct Known Subclasses
Constant Summary collapse
- @@entry_regrexp =
/[A-Z0-9]{1,4}_[A-Z0-9]{1,5}/
- @@data_class =
["STANDARD", "PRELIMINARY"]
- @@ac_regrexp =
Bio::EMBLDB::Common#ac -> ary
#accessions -> ary #accession -> String (accessions.first)
/[OPQ][0-9][A-Z0-9]{3}[0-9]/
- @@cc_topics =
['PHARMACEUTICAL', 'BIOTECHNOLOGY', 'TOXIC DOSE', 'ALLERGEN', 'RNA EDITING', 'POLYMORPHISM', 'BIOPHYSICOCHEMICAL PROPERTIES', 'MASS SPECTROMETRY', 'WEB RESOURCE', 'ENZYME REGULATION', 'DISEASE', 'INTERACTION', 'DEVELOPMENTAL STAGE', 'INDUCTION', 'CAUTION', 'ALTERNATIVE PRODUCTS', 'DOMAIN', 'PTM', 'MISCELLANEOUS', 'TISSUE SPECIFICITY', 'COFACTOR', 'PATHWAY', 'SUBUNIT', 'CATALYTIC ACTIVITY', 'SUBCELLULAR LOCATION', 'FUNCTION', 'SIMILARITY']
- @@dr_database_identifier =
returns databases cross-references in the DR lines.
-
Bio::UniProtKB#dr -> Hash w/in Array
DR Line; defabases cross-reference (>=0)
DR database_identifier; primary_identifier; secondary_identifier. a cross_ref pre one line
-
['EMBL','CARBBANK','DICTYDB','ECO2DBASE', 'ECOGENE', 'FLYBASE','GCRDB','HIV','HSC-2DPAGE','HSSP','INTERPRO','MAIZEDB', 'MAIZE-2DPAGE','MENDEL','MGD''MIM','PDB','PFAM','PIR','PRINTS', 'PROSITE','REBASE','AARHUS/GHENT-2DPAGE','SGD','STYGENE','SUBTILIST', 'SWISS-2DPAGE','TIGR','TRANSFAC','TUBERCULIST','WORMPEP','YEPD','ZFIN']
Constants included from EMBLDB::Common
EMBLDB::Common::DELIMITER, EMBLDB::Common::RS, EMBLDB::Common::TAGSIZE
Instance Method Summary collapse
-
#cc(topic = nil) ⇒ Object
returns contents in the CC lines.
-
#de ⇒ Object
Returns an Array (for new format since rel 14) or a String (for old format before rel 14) for the DE line.
-
#dr(key = nil) ⇒ Object
Bio::UniProtKB#dr.
-
#dt(key = nil) ⇒ Object
returns a Hash of information in the DT lines.
-
#embl_dr ⇒ Object
Backup Bio::EMBLDB#dr as embl_dr.
-
#entry_id ⇒ Object
(also: #entry_name, #entry)
returns a ENTRY_NAME in the ID line.
-
#ft(feature_key = nil) ⇒ Object
returns contents in the feature table.
-
#gene_name ⇒ Object
returns a String of the first gene name in the GN line.
-
#gene_names ⇒ Object
returns a Array of gene names in the GN line.
-
#gn ⇒ Object
returns gene names in the GN line.
-
#hi ⇒ Object
The HI line Bio::UniProtKB#hi #=> hash.
-
#id_line(key = nil) ⇒ Object
returns a Hash of the ID line.
-
#molecule ⇒ Object
(also: #molecule_type)
returns a MOLECULE_TYPE in the ID line.
-
#oh ⇒ Object
The OH Line; .
-
#os(num = nil) ⇒ Object
returns a Array of Hashs or a String of the OS line when a key given.
-
#ox ⇒ Object
returns a Hash of oraganism taxonomy cross-references.
-
#protein_name ⇒ Object
returns the proposed official name of the protein.
-
#ref ⇒ Object
returns contents in the R lines.
-
#references ⇒ Object
returns Bio::Reference object from Bio::EMBLDB::Common#ref.
-
#seq ⇒ Object
(also: #aaseq)
returns a Bio::Sequence::AA of the amino acid sequence.
-
#sequence_length ⇒ Object
(also: #aalen)
returns a SEQUENCE_LENGTH in the ID line.
- #set_RN(data) ⇒ Object
-
#sq(key = nil) ⇒ Object
returns a Hash of conteins in the SQ lines.
-
#synonyms ⇒ Object
returns synonyms (unofficial and/or alternative names).
Methods included from EMBLDB::Common
#ac, #accession, #initialize, #kw, #oc, #og
Methods inherited from EMBLDB
Methods inherited from DB
#exists?, #fetch, #get, open, #tags
Instance Method Details
#cc(topic = nil) ⇒ Object
returns contents in the CC lines.
-
Bio::UniProtKB#cc -> Hash
returns an object of contents in the TOPIC.
-
Bio::UniProtKB#cc(TOPIC) -> Array w/in Hash, Hash
returns contents of the “ALTERNATIVE PRODUCTS”.
-
Bio::UniProtKB#cc(‘ALTERNATIVE PRODUCTS’) -> Hash
{'Event' => str, 'Named isoforms' => int, 'Comment' => str, 'Variants'=>[{'Name' => str, 'Synonyms' => str, 'IsoId' => str, 'Sequence' => []}]} CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; Named isoforms=15; ... CC placentae isoforms. All tissues differentially splice exon 13; CC Name=A; Synonyms=no del; CC IsoId=P15529-1; Sequence=Displayed;
returns contents of the “DATABASE”.
-
Bio::UniProtKB#cc(‘DATABASE’) -> Array
[{'NAME'=>str,'NOTE'=>str, 'WWW'=>URI,'FTP'=>URI}, ...] CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].
returns contents of the “MASS SPECTROMETRY”.
-
Bio::UniProtKB#cc(‘MASS SPECTROMETRY’) -> Array
[{'MW"=>float,'MW_ERR'=>float, 'METHOD'=>str,'RANGE'=>str}, ...] CC -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].
CC lines (>=0, optional)
CC -!- TISSUE SPECIFICITY: HIGHEST LEVELS FOUND IN TESTIS. ALSO PRESENT
CC IN LIVER, KIDNEY, LUNG AND BRAIN.
CC -!- TOPIC: FIRST LINE OF A COMMENT BLOCK;
CC SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.
806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 806 def cc(topic = nil) unless @data['CC'] cc = Hash.new comment_border= '-' * (77 - 4 + 1) dlm = /-!- / # 12KD_MYCSM has no CC lines. return cc if get('CC').size == 0 cc_raw = fetch('CC') # Removing the copyright statement. cc_raw.sub!(/ *---.+---/m, '') # Not any CC Lines without the copyright statement. return cc if cc_raw == '' begin cc_raw, copyright = cc_raw.split(/#{comment_border}/)[0] _ = copyright #dummy for suppress "assigned but unused variable" cc_raw = cc_raw.sub(dlm,'') cc_raw.split(dlm).each do |tmp| tmp = tmp.strip if /(^[A-Z ]+[A-Z]): (.+)/ =~ tmp key = $1 body = $2 body.gsub!(/- (?!AND)/,'-') body.strip! unless cc[key] cc[key] = [body] else cc[key].push(body) end else raise ["Error: [#{entry_id}]: CC Lines", '"', tmp, '"', '', get('CC'),''].join("\n") end end rescue NameError if fetch('CC') == '' return {} else raise ["Error: Invalid CC Lines: [#{entry_id}]: ", "\n'#{self.get('CC')}'\n", "(#{$!})"].join end rescue NoMethodError end @data['CC'] = cc end case topic when 'ALLERGEN' return @data['CC'][topic] when 'ALTERNATIVE PRODUCTS' return cc_alternative_products(@data['CC'][topic]) when 'BIOPHYSICOCHEMICAL PROPERTIES' return cc_biophysiochemical_properties(@data['CC'][topic]) when 'BIOTECHNOLOGY' return @data['CC'][topic] when 'CATALITIC ACTIVITY' return cc_catalytic_activity(@data['CC'][topic]) when 'CAUTION' return cc_caution(@data['CC'][topic]) when 'COFACTOR' return @data['CC'][topic] when 'DEVELOPMENTAL STAGE' return @data['CC'][topic].join('') when 'DISEASE' return @data['CC'][topic].join('') when 'DOMAIN' return @data['CC'][topic] when 'ENZYME REGULATION' return @data['CC'][topic].join('') when 'FUNCTION' return @data['CC'][topic].join('') when 'INDUCTION' return @data['CC'][topic].join('') when 'INTERACTION' return cc_interaction(@data['CC'][topic]) when 'MASS SPECTROMETRY' return cc_mass_spectrometry(@data['CC'][topic]) when 'MISCELLANEOUS' return @data['CC'][topic] when 'PATHWAY' return cc_pathway(@data['CC'][topic]) when 'PHARMACEUTICAL' return @data['CC'][topic] when 'POLYMORPHISM' return @data['CC'][topic] when 'PTM' return @data['CC'][topic] when 'RNA EDITING' return cc_rna_editing(@data['CC'][topic]) when 'SIMILARITY' return @data['CC'][topic] when 'SUBCELLULAR LOCATION' return cc_subcellular_location(@data['CC'][topic]) when 'SUBUNIT' return @data['CC'][topic] when 'TISSUE SPECIFICITY' return @data['CC'][topic] when 'TOXIC DOSE' return @data['CC'][topic] when 'WEB RESOURCE' return cc_web_resource(@data['CC'][topic]) when 'DATABASE' # DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"]. tmp = Array.new db = @data['CC']['DATABASE'] return db unless db db.each do |e| db = {'NAME' => nil, 'NOTE' => nil, 'WWW' => nil, 'FTP' => nil} e.sub(/.$/,'').split(/;/).each do |line| case line when /NAME=(.+)/ db['NAME'] = $1 when /NOTE=(.+)/ db['NOTE'] = $1 when /WWW="(.+)"/ db['WWW'] = $1 when /FTP="(.+)"/ db['FTP'] = $1 end end tmp.push(db) end return tmp when nil return @data['CC'] else return @data['CC'][topic] end end |
#de ⇒ Object
Returns an Array (for new format since rel 14) or a String (for old format before rel 14) for the DE line.
333 334 335 336 337 338 339 340 341 342 343 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 333 def de return @data['DE'] if @data['DE'] parsed_de_line = parse_DE_line_rel14(get('DE')) case parsed_de_line when Array # new format since rel14 @data['DE'] ||= parsed_de_line else super end @data['DE'] end |
#dr(key = nil) ⇒ Object
Bio::UniProtKB#dr
1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 1171 def dr(key = nil) unless key embl_dr else (embl_dr[key] or []).map {|x| {'Accession' => x[0], 'Version' => x[1], ' ' => x[2], 'Molecular Type' => x[3]} } end end |
#dt(key = nil) ⇒ Object
returns a Hash of information in the DT lines.
hash keys:
['created', 'sequence', 'annotation']
–
also Symbols acceptable (ASAP):
[:created, :sequence, :annotation]
++
Since UniProtKB release 7.0 of 07-Feb-2006, the DT line format is changed, and the word “annotation” is no longer used in DT lines. Despite the change, the word “annotation” is still used for keeping compatibility.
returns a String of information in the DT lines by a given key.
DT Line; date (3/entry)
DT DD-MMM-YYY (integrated into UniProtKB/XXXXX.)
DT DD-MMM-YYY (sequence version NN)
DT DD-MMM-YYY (entry version NN)
The format have been changed in UniProtKB release 7.0 of 07-Feb-2006. Below is the older format.
Old format of DT Line; date (3/entry)
DT DD-MMM-YYY (rel. NN, Created)
DT DD-MMM-YYY (rel. NN, Last sequence update)
DT DD-MMM-YYY (rel. NN, Last annotation update)
157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 157 def dt(key = nil) return dt[key] if key return @data['DT'] if @data['DT'] part = self.get('DT').split(/\n/) @data['DT'] = { 'created' => part[0].sub(/\w{2} /,'').strip, 'sequence' => part[1].sub(/\w{2} /,'').strip, 'annotation' => part[2].sub(/\w{2} /,'').strip } end |
#embl_dr ⇒ Object
Backup Bio::EMBLDB#dr as embl_dr
1168 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 1168 alias :embl_dr :dr |
#entry_id ⇒ Object Also known as: entry_name, entry
returns a ENTRY_NAME in the ID line.
98 99 100 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 98 def entry_id id_line('ENTRY_NAME') end |
#ft(feature_key = nil) ⇒ Object
returns contents in the feature table.
Examples
sp = Bio::UniProtKB.new(entry)
ft = sp.ft
ft.class #=> Hash
ft.keys.each do |feature_key|
ft[feature_key].each do |feature|
feature['From'] #=> '1'
feature['To'] #=> '21'
feature['Description'] #=> ''
feature['FTId'] #=> ''
feature['diff'] #=> []
feature['original'] #=> [feature_key, '1', '21', '', '']
end
end
-
Bio::UniProtKB#ft -> Hash
{FEATURE_KEY => [{'From' => int, 'To' => int, 'Description' => aStr, 'FTId' => aStr, 'diff' => [original_residues, changed_residues], 'original' => aAry }],...}
returns an Array of the information about the feature_name in the feature table.
-
Bio::UniProtKB#ft(feature_name) -> Array of Hash
[{'From' => str, 'To' => str, 'Description' => str, 'FTId' => str},...]
FT Line; feature table data (>=0, optional)
Col Data item
----- -----------------
1- 2 FT
6-13 Feature name
15-20 `FROM' endpoint
22-27 `TO' endpoint
35-75 Description (>=0 per key)
----- -----------------
Note: ‘FROM’ and ‘TO’ endopoints are allowed to use non-numerial charactors including ‘<’, ‘>’ or ‘?’. (c.f. ‘<1’, ‘?42’)
1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 1236 def ft(feature_key = nil) return ft[feature_key] if feature_key return @data['FT'] if @data['FT'] ftstr = get('FT') ftlines = ftstr.split("\n") for i in 0..10 do if /^FT +([^\s]+) +(([^\s]+)\:)?([\<\?]?[0-9]+|\?)(?:\.\.([\>\?]?[0-9]+|\?))?\s*$/ =~ ftlines[i] && /^FT +\/([^\s\=]+)(?:\=(\")?(.+)(\")?)?\s*$/ =~ ftlines[i+1] then fmt_2019_11 = true break #for i end end #for i hash = if fmt_2019_11 then ft_2019_11_parser(ftlines) else ft_legacy_parser(ftlines) end @data['FT'] = hash end |
#gene_name ⇒ Object
returns a String of the first gene name in the GN line.
448 449 450 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 448 def gene_name (x = self.gene_names) ? x.first : nil end |
#gene_names ⇒ Object
returns a Array of gene names in the GN line.
437 438 439 440 441 442 443 444 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 437 def gene_names gn # set @data['GN'] if it hasn't been already done if @data['GN'].first.class == Hash then @data['GN'].collect { |element| element[:name] } else @data['GN'].first end end |
#gn ⇒ Object
returns gene names in the GN line.
New UniProt/SwissProt format:
-
Bio::UniProtKB#gn -> [ <gene record>* ]
where <gene record> is:
{ :name => '...',
:synonyms => [ 's1', 's2', ... ],
:loci => [ 'l1', 'l2', ... ],
:orfs => [ 'o1', 'o2', ... ]
}
Old format:
-
Bio::UniProtKB#gn -> Array # AND
-
Bio::UniProtKB#gn -> Array # OR
GN Line: Gene name(s) (>=0, optional)
361 362 363 364 365 366 367 368 369 370 371 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 361 def gn unless @data['GN'] case fetch('GN') when /Name=/,/ORFNames=/,/OrderedLocusNames=/,/Synonyms=/ @data['GN'] = gn_uniprot_parser else @data['GN'] = gn_old_parser end end @data['GN'] end |
#hi ⇒ Object
The HI line
Bio::UniProtKB#hi #=> hash
722 723 724 725 726 727 728 729 730 731 732 733 734 735 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 722 def hi unless @data['HI'] @data['HI'] = [] fetch('HI').split(/\. /).each do |hlist| hash = {'Category' => '', 'Keywords' => [], 'Keyword' => ''} hash['Category'], hash['Keywords'] = hlist.split(': ') hash['Keywords'] = hash['Keywords'].split('; ') hash['Keyword'] = hash['Keywords'].pop hash['Keyword'].sub!(/\.$/, '') @data['HI'] << hash end end @data['HI'] end |
#id_line(key = nil) ⇒ Object
returns a Hash of the ID line.
returns a content (Int or String) of the ID line by a given key. Hash keys: [‘ENTRY_NAME’, ‘DATA_CLASS’, ‘MODECULE_TYPE’, ‘SEQUENCE_LENGTH’]
ID Line (since UniProtKB release 9.0 of 31-Oct-2006)
ID P53_HUMAN Reviewed; 393 AA.
#"ID #{ENTRY_NAME} #{DATA_CLASS}; #{SEQUENCE_LENGTH}."
Examples
obj.id_line #=> {"ENTRY_NAME"=>"P53_HUMAN", "DATA_CLASS"=>"Reviewed",
"SEQUENCE_LENGTH"=>393, "MOLECULE_TYPE"=>nil}
obj.id_line('ENTRY_NAME') #=> "P53_HUMAN"
ID Line (older style)
ID P53_HUMAN STANDARD; PRT; 393 AA.
#"ID #{ENTRY_NAME} #{DATA_CLASS}; #{MOLECULE_TYPE}; #{SEQUENCE_LENGTH}."
Examples
obj.id_line #=> {"ENTRY_NAME"=>"P53_HUMAN", "DATA_CLASS"=>"STANDARD",
"SEQUENCE_LENGTH"=>393, "MOLECULE_TYPE"=>"PRT"}
obj.id_line('ENTRY_NAME') #=> "P53_HUMAN"
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 73 def id_line(key = nil) return id_line[key] if key return @data['ID'] if @data['ID'] part = @orig['ID'].split(/ +/) if part[4].to_s.chomp == 'AA.' then # after UniProtKB release 9.0 of 31-Oct-2006 # (http://www.uniprot.org/docs/sp_news.htm) molecule_type = nil sequence_length = part[3].to_i else molecule_type = part[3].sub(/;/,'') sequence_length = part[4].to_i end @data['ID'] = { 'ENTRY_NAME' => part[1], 'DATA_CLASS' => part[2].sub(/;/,''), 'MOLECULE_TYPE' => molecule_type, 'SEQUENCE_LENGTH' => sequence_length } end |
#molecule ⇒ Object Also known as: molecule_type
returns a MOLECULE_TYPE in the ID line.
A short-cut for Bio::UniProtKB#id_line(‘MOLECULE_TYPE’).
108 109 110 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 108 def molecule id_line('MOLECULE_TYPE') end |
#oh ⇒ Object
The OH Line;
OH NCBI_TaxID=TaxID; HostName. br.expasy.org/sprot/userman.html#OH_line
531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 531 def oh unless @data['OH'] oh = [] a = fetch('OH').split(/(NCBI\_TaxID\=)(\d+)(\;)/) t = catch :error do taxid = nil host_name = nil while x = a.shift x = x.to_s.strip case x when '' next when 'NCBI_TaxID=' if taxid then oh.push({'NCBI_TaxID' => taxid, 'HostName' => host_name}) taxid = nil host_name = nil end taxid = a.shift throw :error, :missing_semicolon if a.shift != ';' else throw :error, :missing_taxid if host_name host_name = x host_name.sub!(/\.\z/, '') end end #while x... if taxid then oh.push({'NCBI_TaxID' => taxid, 'HostName' => host_name}) elsif host_name then throw :error, :missing_taxid_last end nil end #t = catch... if t then raise ArgumentError, ["Error: Invalid OH line format (#{self.entry_id}):", $!, "\n", get('OH'), "\n"].join end @data['OH'] = oh end @data['OH'] end |
#os(num = nil) ⇒ Object
returns a Array of Hashs or a String of the OS line when a key given.
-
Bio::EMBLDB#os -> Array
[{'name' => '(Human)', 'os' => 'Homo sapiens'},
{'name' => '(Rat)', 'os' => 'Rattus norveticus'}]
-
Bio::EPTR#os -> Hash
{'name' => "(Human)", 'os' => 'Homo sapiens'}
-
Bio::UniProtKB#os[‘name’] -> “(Human)”
-
Bio::EPTR#os(0) -> “Homo sapiens (Human)”
OS Line; organism species (>=1)
OS Genus species (name).
OS Genus species (name0) (name1).
OS Genus species (name0) (name1).
OS Genus species (name0), G s0 (name0), and G s (name0) (name1).
OS Homo sapiens (Human), and Rarrus norveticus (Rat)
OS Hippotis sp. Clark and Watts 825.
OS unknown cyperaceous sp.
470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 470 def os(num = nil) unless @data['OS'] os = Array.new fetch('OS').split(/, and|, /).each do |tmp| if tmp =~ /(\w+ *[\w \:\'\+\-\.]+[\w\.])/ org = $1 tmp =~ /(\(.+\))/ os.push({'name' => $1, 'os' => org}) else raise "Error: OS Line. #{$!}\n#{fetch('OS')}\n" end end @data['OS'] = os end if num # EX. "Trifolium repens (white clover)" return "#{@data['OS'][num]['os']} #{@data['OS'][num]['name']}" else return @data['OS'] end end |
#ox ⇒ Object
returns a Hash of oraganism taxonomy cross-references.
-
Bio::UniProtKB#ox -> Hash
{'NCBI_TaxID' => ['1234','2345','3456','4567'], ...}
OX Line; organism taxonomy cross-reference (>=1 per entry)
OX NCBI_TaxID=1234;
OX NCBI_TaxID=1234, 2345, 3456, 4567;
514 515 516 517 518 519 520 521 522 523 524 525 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 514 def ox unless @data['OX'] tmp = fetch('OX').sub(/\.$/,'').split(/;/).map { |e| e.strip } hsh = Hash.new tmp.each do |e| db,refs = e.split(/=/) hsh[db] = refs.split(/, */) end @data['OX'] = hsh end return @data['OX'] end |
#protein_name ⇒ Object
returns the proposed official name of the protein. Returns a String.
Since UniProtKB release 14.0 of 22-Jul-2008, the DE line format have been changed. The method returns the full name which is taken from “RecName: Full=” or “SubName: Full=” line normally in the beginning of the DE lines. Unlike parser for old format, no special treatments for fragment or precursor.
For old format, the method parses the DE lines and returns the protein name as a String.
DE Line; description (>=1)
"DE #{OFFICIAL_NAME} (#{SYNONYM})"
"DE #{OFFICIAL_NAME} (#{SYNONYM}) [CONTEINS: #1; #2]."
OFFICIAL_NAME 1/entry
SYNONYM >=0
CONTEINS >=0
250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 250 def protein_name parsed_de_line = self.de if parsed_de_line.kind_of?(Array) then # since UniProtKB release 14.0 of 22-Jul-2008 name = nil parsed_de_line.each do |a| case a[0] when 'RecName', 'SubName' if name_pair = a[1..-1].find { |b| b[0] == 'Full' } then name = name_pair[1] break end end end name = name.to_s else # old format (before Rel. 13.x) name = "" if de_line = fetch('DE') then str = de_line[/^[^\[]*/] # everything preceding the first [ (the "contains" part) name = str[/^[^(]*/].strip name << ' (Fragment)' if str =~ /fragment/i end end return name end |
#ref ⇒ Object
returns contents in the R lines.
-
Bio::EMBLDB::Common#ref -> [ <refernece information Hash>* ]
where <reference information Hash> is:
{'RN' => '', 'RC' => '', 'RP' => '', 'RX' => '',
'RA' => '', 'RT' => '', 'RL' => '', 'RG' => ''}
R Lines
-
RN RC RP RX RA RT RL RG
588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 588 def ref unless @data['R'] @data['R'] = [get('R').split(/\nRN /)].flatten.map { |str| hash = {'RN' => '', 'RC' => '', 'RP' => '', 'RX' => '', 'RA' => '', 'RT' => '', 'RL' => '', 'RG' => ''} str = 'RN ' + str unless /^RN / =~ str str.split("\n").each do |line| if /^(R[NPXARLCTG]) (.+)/ =~ line hash[$1] += $2 + ' ' else raise "Invalid format in R lines, \n[#{line}]\n" end end hash['RN'] = set_RN(hash['RN']) hash['RC'] = set_RC(hash['RC']) hash['RP'] = set_RP(hash['RP']) hash['RX'] = set_RX(hash['RX']) hash['RA'] = set_RA(hash['RA']) hash['RT'] = set_RT(hash['RT']) hash['RL'] = set_RL(hash['RL']) hash['RG'] = set_RG(hash['RG']) hash } end @data['R'] end |
#references ⇒ Object
returns Bio::Reference object from Bio::EMBLDB::Common#ref.
-
Bio::EMBLDB::Common#ref -> Bio::References
682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 682 def references unless @data['references'] ary = self.ref.map {|ent| hash = Hash.new('') ent.each {|key, value| case key when 'RA' hash['authors'] = value.split(/, /) when 'RT' hash['title'] = value when 'RL' if value =~ /(.*) (\d+) \((\d+)\), (\d+-\d+) \((\d+)\)$/ hash['journal'] = $1 hash['volume'] = $2 hash['issue'] = $3 hash['pages'] = $4 hash['year'] = $5 else hash['journal'] = value end when 'RX' # PUBMED, MEDLINE, DOI value.each do |tag, xref| hash[ tag.downcase ] = xref end end } Reference.new(hash) } @data['references'] = References.new(ary) end @data['references'] end |
#seq ⇒ Object Also known as: aaseq
returns a Bio::Sequence::AA of the amino acid sequence.
-
Bio::UniProtKB#seq -> Bio::Sequence::AA
blank Line; sequence data (>=1)
1464 1465 1466 1467 1468 1469 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 1464 def seq unless @data[''] @data[''] = Sequence::AA.new( fetch('').gsub(/ |\d+/,'') ) end return @data[''] end |
#sequence_length ⇒ Object Also known as: aalen
returns a SEQUENCE_LENGTH in the ID line.
A short-cut for Bio::UniProtKB#id_line(‘SEQUENCE_LENGHT’).
117 118 119 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 117 def sequence_length id_line('SEQUENCE_LENGTH') end |
#set_RN(data) ⇒ Object
619 620 621 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 619 def set_RN(data) data.strip end |
#sq(key = nil) ⇒ Object
returns a Hash of conteins in the SQ lines.
-
Bio::UniProtKBL#sq -> hsh
returns a value of a key given in the SQ lines.
-
Bio::UniProtKBL#sq(key) -> int or str
-
Keys: [‘MW’, ‘mw’, ‘molecular’, ‘weight’, ‘aalen’, ‘len’, ‘length’,
'CRC64']
SQ Line; sequence header (1/entry)
SQ SEQUENCE 233 AA; 25630 MW; 146A1B48A1475C86 CRC64;
SQ SEQUENCE \d+ AA; \d+ MW; [0-9A-Z]+ CRC64;
MW, Dalton unit. CRC64 (64-bit Cyclic Redundancy Check, ISO 3309).
1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 1436 def sq(key = nil) unless @data['SQ'] if fetch('SQ') =~ /(\d+) AA\; (\d+) MW; (.+) CRC64;/ @data['SQ'] = { 'aalen' => $1.to_i, 'MW' => $2.to_i, 'CRC64' => $3 } else raise "Invalid SQ Line: \n'#{fetch('SQ')}'" end end if key case key when /mw/, /molecular/, /weight/ @data['SQ']['MW'] when /len/, /length/, /AA/ @data['SQ']['aalen'] else @data['SQ'][key] end else @data['SQ'] end end |
#synonyms ⇒ Object
returns synonyms (unofficial and/or alternative names). Returns an Array containing String objects.
Since UniProtKB release 14.0 of 22-Jul-2008, the DE line format have been changed. The method returns the full or short names which are taken from “RecName: Short=”, “RecName: EC=”, and AltName lines, except after “Contains:” or “Includes:”. For keeping compatibility with old format parser, “RecName: EC=N.N.N.N” is reported as “EC N.N.N.N”. In addition, to prevent confusion, “Allergen=” and “CD_antigen=” prefixes are added for the corresponding fields.
For old format, the method parses the DE lines and returns synonyms. synonyms are each placed in () following the official name on the DE line.
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 |
# File 'lib/bio/db/embl/uniprotkb.rb', line 291 def synonyms ary = Array.new parsed_de_line = self.de if parsed_de_line.kind_of?(Array) then # since UniProtKB release 14.0 of 22-Jul-2008 parsed_de_line.each do |a| case a[0] when 'Includes', 'Contains' break #the each loop when 'RecName', 'SubName', 'AltName' a[1..-1].each do |b| if name = b[1] and b[1] != self.protein_name then case b[0] when 'EC' name = "EC " + b[1] when 'Allergen', 'CD_antigen' name = b[0] + '=' + b[1] else name = b[1] end ary.push name end end end #case a[0] end #parsed_de_line.each else # old format (before Rel. 13.x) if de_line = fetch('DE') then line = de_line.sub(/\[.*\]/,'') # ignore stuff between [ and ]. That's the "contains" part line.scan(/\([^)]+/) do |synonym| unless synonym =~ /fragment/i then ary << synonym[1..-1].strip # index to remove the leading ( end end end end return ary end |