Class: AnswersEngine::Scraper::Executor Abstract
- Inherits:
-
Object
- Object
- AnswersEngine::Scraper::Executor
- Includes:
- Plugin::ContextExposer
- Defined in:
- lib/answersengine/scraper/executor.rb
Overview
Direct Known Subclasses
RubyFinisherExecutor, RubyParserExecutor, RubySeederExecutor
Constant Summary collapse
- MAX_FIND_OUTPUTS_PER_PAGE =
Max allowed page size when query outputs (see #find_outputs).
500
Instance Attribute Summary collapse
-
#filename ⇒ Object
Returns the value of attribute filename.
-
#gid ⇒ Object
Returns the value of attribute gid.
-
#job_id ⇒ Object
Returns the value of attribute job_id.
Instance Method Summary collapse
- #clean_backtrace(backtrace) ⇒ Object
-
#eval_with_context(file_path, context) ⇒ Object
Eval a filename with a custom binding.
- #exec_parser(save = false) ⇒ Object
-
#find_output(collection = 'default', query = {}, opts = {}) ⇒ Hash|nil
Find one output by collection and query with pagination.
-
#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ Array
Find outputs by collection and query with pagination.
- #finisher_update(options = {}) ⇒ Object
- #get_content(gid) ⇒ Object
- #get_failed_content(gid) ⇒ Object
-
#get_job_id(scraper_name, default = nil) ⇒ Object
Get current job id from scraper or default when scraper_name is null.
- #init_global_page ⇒ Object
- #init_job_page ⇒ Object
- #init_page ⇒ Object
- #parsing_update(options = {}) ⇒ Object
-
#remove_old_dups!(list, key_defaults) ⇒ Integer
Remove dups by prioritizing the latest dup.
-
#remove_old_output_dups!(list) ⇒ Integer
Remove dups by prioritizing the latest dup.
-
#remove_old_page_dups!(list) ⇒ Integer
Remove page dups by prioritizing the latest dup.
-
#save_outputs(outputs = []) ⇒ Object
Saves outputs from an array and clear it.
-
#save_pages(pages = []) ⇒ Object
Saves pages from an array and clear it.
- #save_pages_and_outputs(pages = [], outputs = [], status) ⇒ Object
- #save_type ⇒ Object
- #seeding_update(options = {}) ⇒ Object
- #update_to_server(opts = {}) ⇒ Object
Methods included from Plugin::ContextExposer
#create_context, #expose_to, #exposed_env, #exposed_methods, exposed_methods, #isolated_binding, #var_or_proc
Instance Attribute Details
#filename ⇒ Object
Returns the value of attribute filename.
9 10 11 |
# File 'lib/answersengine/scraper/executor.rb', line 9 def filename @filename end |
#gid ⇒ Object
Returns the value of attribute gid.
9 10 11 |
# File 'lib/answersengine/scraper/executor.rb', line 9 def gid @gid end |
#job_id ⇒ Object
Returns the value of attribute job_id.
9 10 11 |
# File 'lib/answersengine/scraper/executor.rb', line 9 def job_id @job_id end |
Instance Method Details
#clean_backtrace(backtrace) ⇒ Object
313 314 315 316 317 318 319 320 |
# File 'lib/answersengine/scraper/executor.rb', line 313 def clean_backtrace(backtrace) i = backtrace.index{|x| x =~ /gems\/answersengine/i} if i.to_i < 1 return [] else return backtrace[0..(i-1)] end end |
#eval_with_context(file_path, context) ⇒ Object
Using this method will allow scripts to contain return to exit the script sooner along some improved security.
Eval a filename with a custom binding
353 354 355 |
# File 'lib/answersengine/scraper/executor.rb', line 353 def eval_with_context file_path, context eval(File.read(file_path), context, file_path) end |
#exec_parser(save = false) ⇒ Object
13 14 15 |
# File 'lib/answersengine/scraper/executor.rb', line 13 def exec_parser(save=false) raise "should be implemented in subclass" end |
#find_output(collection = 'default', query = {}, opts = {}) ⇒ Hash|nil
*opts :job_id option is prioritize over :scraper_name when both exists. If none add provided or nil values, then current job will be used to query instead, this is the defaul behavior.
Find one output by collection and query with pagination.
196 197 198 199 |
# File 'lib/answersengine/scraper/executor.rb', line 196 def find_output(collection='default', query={}, opts = {}) result = find_outputs(collection, query, 1, 1, opts) result.respond_to?(:first) ? result.first : nil end |
#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ Array
*opts :job_id option is prioritize over :scraper_name when both exists. If none add provided or nil values, then current job will be used to query instead, this is the defaul behavior.
Find outputs by collection and query with pagination.
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/answersengine/scraper/executor.rb', line 140 def find_outputs(collection='default', query={}, page=1, per_page=100, opts = {}) # Validate parameters out from nil for easier user usage. raise ArgumentError.new("collection needs to be a String") unless collection.is_a?(String) raise ArgumentError.new("query needs to be a Hash, instead of: #{query}") unless query.is_a?(Hash) unless page.is_a?(Integer) && page > 0 raise ArgumentError.new("page needs to be an Integer greater than 0") end unless per_page.is_a?(Integer) && per_page > 0 && per_page <= MAX_FIND_OUTPUTS_PER_PAGE raise ArgumentError.new("per_page needs to be an Integer between 1 and #{MAX_FIND_OUTPUTS_PER_PAGE}") end = { query: query, page: page, per_page: per_page} # Get job_id query_job_id = opts[:job_id] || get_job_id(opts[:scraper_name], self.job_id) client = Client::JobOutput.new() response = client.all(query_job_id, collection) if response.code != 200 raise "response_code: #{response.code}|#{response.parsed_response}" end (response.body != 'null') ? response.parsed_response : [] end |
#finisher_update(options = {}) ⇒ Object
54 55 56 57 58 59 |
# File 'lib/answersengine/scraper/executor.rb', line 54 def finisher_update(={}) client = Client::Job.new() job_id = .fetch(:job_id) client.finisher_update(job_id, ) end |
#get_content(gid) ⇒ Object
66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/answersengine/scraper/executor.rb', line 66 def get_content(gid) client = Client::GlobalPage.new() content_json = client.find_content(gid) if content_json['available'] signed_url = content_json['signed_url'] Client::BackblazeContent.new.get_gunzipped_content(signed_url) else nil end end |
#get_failed_content(gid) ⇒ Object
78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/answersengine/scraper/executor.rb', line 78 def get_failed_content(gid) client = Client::GlobalPage.new() content_json = client.find_failed_content(gid) if content_json['available'] signed_url = content_json['signed_url'] Client::BackblazeContent.new.get_gunzipped_content(signed_url) else nil end end |
#get_job_id(scraper_name, default = nil) ⇒ Object
Get current job id from scraper or default when scraper_name is null.
97 98 99 100 101 102 |
# File 'lib/answersengine/scraper/executor.rb', line 97 def get_job_id scraper_name, default = nil return default if scraper_name.nil? job = Client::ScraperJob.new().find(scraper_name) raise JSON.pretty_generate(job) if job['id'].nil? job['id'] end |
#init_global_page ⇒ Object
61 62 63 64 |
# File 'lib/answersengine/scraper/executor.rb', line 61 def init_global_page() client = Client::GlobalPage.new() client.find(gid) end |
#init_job_page ⇒ Object
28 29 30 31 32 33 34 35 36 37 |
# File 'lib/answersengine/scraper/executor.rb', line 28 def init_job_page() client = Client::JobPage.new() job_page = client.find(job_id, gid) unless job_page.code == 200 raise "Job #{job_id} or GID #{gid} not found. Aborting execution!" else job_page end end |
#init_page ⇒ Object
17 18 19 20 21 22 23 24 25 26 |
# File 'lib/answersengine/scraper/executor.rb', line 17 def init_page() if job_id puts "getting Job Page" init_job_page else puts "getting Global Page" init_global_page() end end |
#parsing_update(options = {}) ⇒ Object
39 40 41 42 43 44 45 |
# File 'lib/answersengine/scraper/executor.rb', line 39 def parsing_update(={}) client = Client::JobPage.new() job_id = .fetch(:job_id) gid = .fetch(:gid) client.parsing_update(job_id, gid, ) end |
#remove_old_dups!(list, key_defaults) ⇒ Integer
Remove dups by prioritizing the latest dup.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/answersengine/scraper/executor.rb', line 208 def remove_old_dups!(list, key_defaults) raw_count = list.count keys = key_defaults.keys force_uniq = 0 list.reverse!.uniq! do |item| # Extract stringify keys as hash key_hash = Hash[item.map{|k,v|keys.include?(k.to_s) ? [k.to_s,v] : nil}.select{|i|!i.nil?}] # Apply defaults for uniq validation key_defaults.each{|k,v| key_hash[k] = v if key_hash[k].nil?} # Don't dedup nil key defaults skip_dedup = !keys.find{|k| key_hash[k].nil?}.nil? skip_dedup ? (force_uniq += 1) : key_hash end list.reverse! dup_count = raw_count - list.count dup_count end |
#remove_old_output_dups!(list) ⇒ Integer
Remove dups by prioritizing the latest dup.
248 249 250 251 252 253 254 |
# File 'lib/answersengine/scraper/executor.rb', line 248 def remove_old_output_dups!(list) key_defaults = { '_id' => nil, '_collection' => 'default' } remove_old_dups! list, key_defaults end |
#remove_old_page_dups!(list) ⇒ Integer
It will not dedup for now as it is hard to build gid. TODO: Build gid so we can dedup
Remove page dups by prioritizing the latest dup.
236 237 238 239 240 241 |
# File 'lib/answersengine/scraper/executor.rb', line 236 def remove_old_page_dups!(list) key_defaults = { 'gid' => nil } remove_old_dups! list, key_defaults end |
#save_outputs(outputs = []) ⇒ Object
IMPORTANT: outputs array’s elements will be removed.
Saves outputs from an array and clear it.
342 343 344 |
# File 'lib/answersengine/scraper/executor.rb', line 342 def save_outputs(outputs=[]) save_pages_and_outputs([], outputs, save_type) end |
#save_pages(pages = []) ⇒ Object
IMPORTANT: pages array’s elements will be removed.
Saves pages from an array and clear it.
332 333 334 |
# File 'lib/answersengine/scraper/executor.rb', line 332 def save_pages(pages=[]) save_pages_and_outputs(pages, [], save_type) end |
#save_pages_and_outputs(pages = [], outputs = [], status) ⇒ Object
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
# File 'lib/answersengine/scraper/executor.rb', line 256 def save_pages_and_outputs(pages = [], outputs = [], status) total_pages = pages.count total_outputs = outputs.count records_per_slice = 100 until pages.empty? && outputs.empty? pages_slice = pages.shift(records_per_slice) pages_dup_count = remove_old_page_dups! pages_slice outputs_slice = outputs.shift(records_per_slice) outputs_dup_count = remove_old_output_dups! outputs_slice log_msgs = [] unless pages_slice.empty? page_dups_ignored = pages_dup_count > 0 ? " (#{pages_dup_count} dups ignored)" : '' log_msgs << "#{pages_slice.count} out of #{total_pages} Pages#{page_dups_ignored}" unless save puts '----------------------------------------' puts "Would have saved #{log_msgs.last}#{page_dups_ignored}" puts JSON.pretty_generate pages_slice end end unless outputs_slice.empty? output_dups_ignored = outputs_dup_count > 0 ? " (#{outputs_dup_count} dups ignored)" : '' log_msgs << "#{outputs_slice.count} out of #{total_outputs} Outputs#{output_dups_ignored}" unless save puts '----------------------------------------' puts "Would have saved #{log_msgs.last}#{output_dups_ignored}" puts JSON.pretty_generate outputs_slice end end next unless save log_msg = "Saving #{log_msgs.join(' and ')}." puts "#{log_msg}" # saving to server response = update_to_server( job_id: job_id, gid: gid, pages: pages_slice, outputs: outputs_slice, status: status) if response.code == 200 log_msg = "Saved." puts "#{log_msg}" else puts "Error: Unable to save Pages and/or Outputs to server: #{response.body}" raise "Unable to save Pages and/or Outputs to server: #{response.body}" end end end |
#save_type ⇒ Object
322 323 324 |
# File 'lib/answersengine/scraper/executor.rb', line 322 def save_type raise NotImplementedError.new('Need to implement "save_type" method.') end |
#seeding_update(options = {}) ⇒ Object
47 48 49 50 51 52 |
# File 'lib/answersengine/scraper/executor.rb', line 47 def seeding_update(={}) client = Client::Job.new() job_id = .fetch(:job_id) client.seeding_update(job_id, ) end |
#update_to_server(opts = {}) ⇒ Object
309 310 311 |
# File 'lib/answersengine/scraper/executor.rb', line 309 def update_to_server(opts = {}) raise "Implemented in Subclass" end |