Class: MiGA::RemoteDataset
- Defined in:
- lib/miga/remote_dataset.rb,
lib/miga/remote_dataset/base.rb,
lib/miga/remote_dataset/download.rb
Overview
MiGA representation of datasets with data in remote locations.
Defined Under Namespace
Constant Summary
Constants included from MiGA
CITATION, VERSION, VERSION_DATE, VERSION_NAME
Instance Attribute Summary collapse
-
#db ⇒ Object
readonly
Database storing the dataset.
-
#ids ⇒ Object
readonly
Array of IDs of the entries composing the dataset.
-
#metadata ⇒ Object
readonly
Internal metadata hash.
-
#universe ⇒ Object
readonly
Universe of the dataset.
Class Method Summary collapse
-
.download(universe, db, ids, format, file = nil, extra = [], obj = nil) ⇒ Object
Download data from the
universe
in the databasedb
with IDsids
and informat
. -
.download_rest(opts) ⇒ Object
(also: download_net)
Download data using the REST method.
-
.download_url(url) ⇒ Object
Download the given
url
and return the result regardless of response code. - .ncbi_asm_acc2id(acc) ⇒ Object
-
.ncbi_asm_rest(opts) ⇒ Object
Download data from NCBI Assembly database using the REST method.
-
.ncbi_gb_rest(opts) ⇒ Object
Download data from NCBI GenBank (nuccore) database using the REST method.
-
.ncbi_map(id, dbfrom, db) ⇒ Object
Looks for the entry
id
indbfrom
, and returns the linked identifier indb
(or nil). - .UNIVERSE ⇒ Object
Instance Method Summary collapse
-
#get_gtdb_taxonomy ⇒ Object
Get GTDB taxonomy as MiGA::Taxonomy.
-
#get_metadata(metadata_def = {}) ⇒ Object
Get metadata from the remote location.
-
#get_ncbi_taxid ⇒ Object
Get NCBI Taxonomy ID.
-
#get_ncbi_taxonomy ⇒ Object
Get NCBI taxonomy as MiGA::Taxonomy.
-
#get_type_status(metadata) ⇒ Object
Get the type material status and return an (updated)
metadata
hash. -
#initialize(ids, db, universe) ⇒ RemoteDataset
constructor
Initialize MiGA::RemoteDataset with
ids
in databasedb
fromuniverse
. -
#ncbi_asm_json_doc ⇒ Object
Get the JSON document describing an NCBI assembly entry.
-
#save_to(project, name = nil, is_ref = true, metadata_def = {}) ⇒ Object
Save dataset to the MiGA::Project
project
identified withname
. -
#update_metadata(dataset, metadata = {}) ⇒ Object
Updates the MiGA::Dataset
dataset
with the remotely available metadata, and optionally the Hashmetadata
.
Methods included from Download
Methods inherited from MiGA
CITATION, CITATION_ARRAY, DEBUG, DEBUG_OFF, DEBUG_ON, DEBUG_TRACE_OFF, DEBUG_TRACE_ON, FULL_VERSION, LONG_VERSION, VERSION, VERSION_DATE, #advance, debug?, debug_trace?, initialized?, #like_io?, #num_suffix, rc_path, #result_files_exist?, #say
Methods included from Common::Path
Methods included from Common::Format
#clean_fasta_file, #seqs_length, #tabulate
Methods included from Common::Net
#download_file_ftp, #known_hosts, #remote_connection
Methods included from Common::SystemCall
Constructor Details
#initialize(ids, db, universe) ⇒ RemoteDataset
Initialize MiGA::RemoteDataset with ids
in database db
from universe
.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/miga/remote_dataset.rb', line 42 def initialize(ids, db, universe) ids = [ids] unless ids.is_a? Array @ids = (ids.is_a?(Array) ? ids : [ids]) @db = db.to_sym @universe = universe.to_sym @metadata = {} @metadata[:"#{universe}_#{db}"] = ids.join(',') @@UNIVERSE.keys.include?(@universe) or raise "Unknown Universe: #{@universe}. Try: #{@@UNIVERSE.keys}" @@UNIVERSE[@universe][:dbs].include?(@db) or raise "Unknown Database: #{@db}. Try: #{@@UNIVERSE[@universe][:dbs].keys}" @_ncbi_asm_json_doc = nil # FIXME: Part of the +map_to+ support: # unless @@UNIVERSE[@universe][:dbs][@db][:map_to].nil? # MiGA::RemoteDataset.download # end end |
Instance Attribute Details
#db ⇒ Object (readonly)
Database storing the dataset.
32 33 34 |
# File 'lib/miga/remote_dataset.rb', line 32 def db @db end |
#ids ⇒ Object (readonly)
Array of IDs of the entries composing the dataset.
34 35 36 |
# File 'lib/miga/remote_dataset.rb', line 34 def ids @ids end |
#metadata ⇒ Object (readonly)
Internal metadata hash
36 37 38 |
# File 'lib/miga/remote_dataset.rb', line 36 def @metadata end |
#universe ⇒ Object (readonly)
Universe of the dataset.
30 31 32 |
# File 'lib/miga/remote_dataset.rb', line 30 def universe @universe end |
Class Method Details
.download(universe, db, ids, format, file = nil, extra = [], obj = nil) ⇒ Object
Download data from the universe
in the database db
with IDs ids
and in format
. If passed, it saves the result in file
. Additional parameters specific to the download method can be passed using extra
. Returns String. The obj
can also be passed as MiGA::RemoteDataset or MiGA::Dataset.
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/miga/remote_dataset/download.rb', line 14 def download(universe, db, ids, format, file = nil, extra = [], obj = nil) ids = [ids] unless ids.is_a? Array getter = @@UNIVERSE[universe][:dbs][db][:getter] || :download method = @@UNIVERSE[universe][:method] opts = { universe: universe, db: db, ids: ids, format: format, file: file, extra: extra, obj: obj } doc = send("#{getter}_#{method}", opts) unless opts[:file].nil? ofh = File.open(opts[:file], 'w') ofh.print doc.force_encoding('UTF-8') ofh.close end doc end |
.download_rest(opts) ⇒ Object Also known as: download_net
Download data using the REST method. Supported opts
(Hash) include: universe
(mandatory): Symbol db
(mandatory): Symbol ids
(mandatory): Array of String format
: String extra
: Array
82 83 84 85 86 87 88 89 |
# File 'lib/miga/remote_dataset/download.rb', line 82 def download_rest(opts) u = @@UNIVERSE[opts[:universe]] url = sprintf( u[:url], opts[:db], opts[:ids].join(','), opts[:format], *opts[:extra] ) url = u[:api_key][url] unless u[:api_key].nil? download_url url end |
.download_url(url) ⇒ Object
Download the given url
and return the result regardless of response code. Attempts download up to three times before raising Net::ReadTimeout.
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/miga/remote_dataset/download.rb', line 98 def download_url(url) doc = '' @timeout_try = 0 begin DEBUG 'GET: ' + url URI.parse(url).open(read_timeout: 600) { |f| doc = f.read } rescue => e @timeout_try += 1 raise e if @timeout_try >= 3 sleep 5 # <- For: 429 Too Many Requests DEBUG "RETRYING after: #{e}" retry end doc end |
.ncbi_asm_acc2id(acc) ⇒ Object
15 16 17 18 19 20 21 22 23 |
# File 'lib/miga/remote_dataset.rb', line 15 def ncbi_asm_acc2id(acc) return acc if acc =~ /^\d+$/ search_doc = MiGA::Json.parse( download(:ncbi_search, :assembly, acc, :json), symbolize: false, contents: true ) (search_doc['esearchresult']['idlist'] || []).first end |
.ncbi_asm_rest(opts) ⇒ Object
Download data from NCBI Assembly database using the REST method. Supported opts
(Hash) include: obj
(mandatory): MiGA::RemoteDataset ids
(mandatory): String or Array of String file
: String, passed to download extra
: Array, passed to download format
: String, passed to download
44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/miga/remote_dataset/download.rb', line 44 def ncbi_asm_rest(opts) url_dir = opts[:obj].ncbi_asm_json_doc['ftppath_genbank'] if url_dir.nil? || url_dir.empty? raise MiGA::RemoteDataMissingError.new( "Missing ftppath_genbank in NCBI Assembly JSON" ) end url = "#{url_dir}/#{File.basename url_dir}_genomic.fna.gz" download( :web, :assembly_gz, url, opts[:format], opts[:file], opts[:extra], opts[:obj] ) end |
.ncbi_gb_rest(opts) ⇒ Object
Download data from NCBI GenBank (nuccore) database using the REST method. Supported opts
(Hash) are the same as #download_rest and #ncbi_asm_rest.
62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/miga/remote_dataset/download.rb', line 62 def ncbi_gb_rest(opts) o = download_rest(opts) return o unless o.strip.empty? MiGA::MiGA.DEBUG 'Empty sequence, attempting download from NCBI assembly' opts[:format] = :fasta_gz if opts[:file] File.unlink(opts[:file]) if File.exist? opts[:file] opts[:file] = "#{opts[:file]}.gz" end ncbi_asm_rest(opts) end |
.ncbi_map(id, dbfrom, db) ⇒ Object
Looks for the entry id
in dbfrom
, and returns the linked identifier in db
(or nil).
118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/miga/remote_dataset/download.rb', line 118 def ncbi_map(id, dbfrom, db) doc = download(:ncbi_map, dbfrom, id, :json, nil, [db]) return if doc.empty? tree = MiGA::Json.parse(doc, contents: true) [:linksets, 0, :linksetdbs, 0, :links, 0].each do |i| tree = tree[i] break if tree.nil? end tree end |
.UNIVERSE ⇒ Object
7 8 9 |
# File 'lib/miga/remote_dataset/base.rb', line 7 def UNIVERSE @@UNIVERSE end |
Instance Method Details
#get_gtdb_taxonomy ⇒ Object
Get GTDB taxonomy as MiGA::Taxonomy
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# File 'lib/miga/remote_dataset.rb', line 154 def get_gtdb_taxonomy gtdb_genome = [:gtdb_assembly] or return doc = MiGA::Json.parse( MiGA::RemoteDataset.download( :gtdb, :genome, gtdb_genome, 'taxon-history', nil, [''] ), contents: true ) lineage = { ns: 'gtdb' } lineage.merge!(doc.first) # Get only the latest available classification release = lineage.delete(:release) @metadata[:gtdb_release] = release lineage.transform_values! { |v| v.gsub(/^\S__/, '') } MiGA.DEBUG "Got lineage from #{release}: #{lineage}" MiGA::Taxonomy.new(lineage) end |
#get_metadata(metadata_def = {}) ⇒ Object
Get metadata from the remote location.
101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/miga/remote_dataset.rb', line 101 def ( = {}) .each { |k, v| @metadata[k] = v } case universe when :ebi, :ncbi, :web # Get taxonomy @metadata[:tax] = get_ncbi_taxonomy when :gtdb # Get taxonomy @metadata[:tax] = get_gtdb_taxonomy end @metadata = get_type_status() end |
#get_ncbi_taxid ⇒ Object
Get NCBI Taxonomy ID.
116 117 118 119 |
# File 'lib/miga/remote_dataset.rb', line 116 def get_ncbi_taxid origin = (universe == :ncbi and db == :assembly) ? :web : universe send("get_ncbi_taxid_from_#{origin}") end |
#get_ncbi_taxonomy ⇒ Object
Get NCBI taxonomy as MiGA::Taxonomy
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/miga/remote_dataset.rb', line 136 def get_ncbi_taxonomy tax_id = get_ncbi_taxid or return lineage = { ns: 'ncbi' } doc = MiGA::RemoteDataset.download(:ncbi, :taxonomy, tax_id, :xml) doc.scan(%r{<Taxon>(.*?)</Taxon>}m).map(&:first).each do |i| name = i.scan(%r{<ScientificName>(.*)</ScientificName>}).first.to_a.first rank = i.scan(%r{<Rank>(.*)</Rank>}).first.to_a.first rank = nil if rank == 'no rank' or rank.empty? rank = 'dataset' if lineage.empty? and rank.nil? lineage[rank] = name unless rank.nil? or rank.nil? end MiGA.DEBUG "Got lineage: #{lineage}" MiGA::Taxonomy.new(lineage) end |
#get_type_status(metadata) ⇒ Object
Get the type material status and return an (updated) metadata
hash.
124 125 126 127 128 129 130 131 132 |
# File 'lib/miga/remote_dataset.rb', line 124 def get_type_status() if [:ncbi_asm] get_type_status_ncbi_asm elsif [:ncbi_nuccore] get_type_status_ncbi_nuccore else end end |
#ncbi_asm_json_doc ⇒ Object
Get the JSON document describing an NCBI assembly entry.
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/miga/remote_dataset.rb', line 174 def ncbi_asm_json_doc return @_ncbi_asm_json_doc unless @_ncbi_asm_json_doc.nil? if db == :assembly && %i[ncbi gtdb].include?(universe) [:ncbi_asm] ||= ids.first end return nil unless [:ncbi_asm] ncbi_asm_id = self.class.ncbi_asm_acc2id [:ncbi_asm] txt = nil 3.times do txt = self.class.download(:ncbi_summary, :assembly, ncbi_asm_id, :json) txt.empty? ? sleep(1) : break end doc = MiGA::Json.parse(txt, symbolize: false, contents: true) return if doc.nil? || doc['result'].nil? || doc['result'].empty? @_ncbi_asm_json_doc = doc['result'][ doc['result']['uids'].first ] end |
#save_to(project, name = nil, is_ref = true, metadata_def = {}) ⇒ Object
Save dataset to the MiGA::Project project
identified with name
. is_ref
indicates if it should be a reference dataset, and contains metadata_def
. If metadata_def
includes metadata_only: true, no input data is downloaded.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/miga/remote_dataset.rb', line 65 def save_to(project, name = nil, is_ref = true, = {}) name ||= ids.join('_').miga_name project = MiGA::Project.new(project) if project.is_a? String MiGA::Dataset.exist?(project, name) and raise "Dataset #{name} exists in the project, aborting..." @metadata = () udb = @@UNIVERSE[universe][:dbs][db] @metadata["#{universe}_#{db}"] = ids.join(',') unless @metadata[:metadata_only] respond_to?("save_#{udb[:stage]}_to", true) or raise "Unexpected error: Unsupported stage #{udb[:stage]} for #{db}." send "save_#{udb[:stage]}_to", project, name, udb end dataset = MiGA::Dataset.new(project, name, is_ref, ) project.add_dataset(dataset.name) unless @metadata[:metadata_only] result = dataset.add_result(udb[:stage], true, is_clean: true) result.nil? and raise 'Empty dataset: seed result not added due to incomplete files.' result.clean! result.save end dataset end |
#update_metadata(dataset, metadata = {}) ⇒ Object
Updates the MiGA::Dataset dataset
with the remotely available metadata, and optionally the Hash metadata
.
93 94 95 96 97 |
# File 'lib/miga/remote_dataset.rb', line 93 def (dataset, = {}) = () .each { |k, v| dataset.[k] = v } dataset.save end |