Class: AnswersEngine::Scraper::Executor Abstract

Inherits:

Object

Object
AnswersEngine::Scraper::Executor

show all

Includes:: Plugin::ContextExposer

Defined in:: lib/answersengine/scraper/executor.rb

Overview

This class is abstract.

Direct Known Subclasses

RubyFinisherExecutor, RubyParserExecutor, RubySeederExecutor

Constant Summary collapse

MAX_FIND_OUTPUTS_PER_PAGE = Max allowed page size when query outputs (see #find_outputs).

Instance Attribute Summary collapse

#filename ⇒ Object

Returns the value of attribute filename.
#gid ⇒ Object

Returns the value of attribute gid.
#job_id ⇒ Object

Returns the value of attribute job_id.

Instance Method Summary collapse

#clean_backtrace(backtrace) ⇒ Object
#eval_with_context(file_path, context) ⇒ Object

Eval a filename with a custom binding.
#exec_parser(save = false) ⇒ Object
#find_output(collection = 'default', query = {}, opts = {}) ⇒ Hash|nil

Find one output by collection and query with pagination.
#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ Array

Find outputs by collection and query with pagination.
#finisher_update(options = {}) ⇒ Object
#get_content(gid) ⇒ Object
#get_failed_content(gid) ⇒ Object
#get_job_id(scraper_name, default = nil) ⇒ Object

Get current job id from scraper or default when scraper_name is null.
#init_global_page ⇒ Object
#init_job_page ⇒ Object
#init_page ⇒ Object
#parsing_update(options = {}) ⇒ Object
#remove_old_dups!(list, key_defaults) ⇒ Integer

Remove dups by prioritizing the latest dup.
#remove_old_output_dups!(list) ⇒ Integer

Remove dups by prioritizing the latest dup.
#remove_old_page_dups!(list) ⇒ Integer

Remove page dups by prioritizing the latest dup.
#save_outputs(outputs = []) ⇒ Object

Saves outputs from an array and clear it.
#save_pages(pages = []) ⇒ Object

Saves pages from an array and clear it.
#save_pages_and_outputs(pages = [], outputs = [], status) ⇒ Object
#save_type ⇒ Object
#seeding_update(options = {}) ⇒ Object
#update_to_server(opts = {}) ⇒ Object

Methods included from Plugin::ContextExposer

#create_context, #expose_to, #exposed_env, #exposed_methods, exposed_methods, #isolated_binding, #var_or_proc

Instance Attribute Details

#filename ⇒ `Object`

Returns the value of attribute filename.



9
10
11

# File 'lib/answersengine/scraper/executor.rb', line 9

def filename
  @filename
end

#gid ⇒ `Object`

Returns the value of attribute gid.



9
10
11

# File 'lib/answersengine/scraper/executor.rb', line 9

def gid
  @gid
end

#job_id ⇒ `Object`

Returns the value of attribute job_id.



9
10
11

# File 'lib/answersengine/scraper/executor.rb', line 9

def job_id
  @job_id
end

Instance Method Details

#clean_backtrace(backtrace) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 313

def clean_backtrace(backtrace)
  i = backtrace.index{|x| x =~ /gems\/answersengine/i}
  if i.to_i < 1
    return []
  else
    return backtrace[0..(i-1)]
  end
end

#eval_with_context(file_path, context) ⇒ `Object`

Note:

Using this method will allow scripts to contain return to exit the script sooner along some improved security.

Eval a filename with a custom binding

Parameters:

file_path (String) —

File path to read.
context (Binding) —

Context binding to evaluate with.



353
354
355

# File 'lib/answersengine/scraper/executor.rb', line 353

def eval_with_context file_path, context
  eval(File.read(file_path), context, file_path)
end

#exec_parser(save = false) ⇒ `Object`



13
14
15

# File 'lib/answersengine/scraper/executor.rb', line 13

def exec_parser(save=false)
  raise "should be implemented in subclass"
end

#find_output(collection = 'default', query = {}, opts = {}) ⇒ `Hash|nil`

Note:

*opts :job_id option is prioritize over :scraper_name when both exists. If none add provided or nil values, then current job will be used to query instead, this is the defaul behavior.

Find one output by collection and query with pagination.

Examples:

find_output

find_output 'my_collection'

find_output 'my_collection', {}

Find from another scraper by name

find_output 'my_collection', {}, scraper_name: 'my_scraper'

Find from another scraper by job_id

find_output 'my_collection', {}, job_id: 123

Parameters:

collection (String) (defaults to: 'default') —

(‘default’) Collection name.
query (Hash) (defaults to: {}) —

({}) Filters to query.
opts (Hash) (defaults to: {}) —

({}) Configuration options.

Options Hash (opts):

:scraper_name (String|nil) — default: nil —

Scraper name to query from.
:job_id (Integer|nil) — default: nil —

Job’s id to query from.

Returns:

(Hash|nil) —

Hash when found, and nil when no output is found.

Raises:

(ArgumentError) —

collection is not String.
(ArgumentError) —

query is not a Hash.

# File 'lib/answersengine/scraper/executor.rb', line 196

def find_output(collection='default', query={}, opts = {})
  result = find_outputs(collection, query, 1, 1, opts)
  result.respond_to?(:first) ? result.first : nil
end

#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ `Array`

Note:

*opts :job_id option is prioritize over :scraper_name when both exists. If none add provided or nil values, then current job will be used to query instead, this is the defaul behavior.

Find outputs by collection and query with pagination.

Examples:

find_outputs

find_outputs 'my_collection'

find_outputs 'my_collection', {}

find_outputs 'my_collection', {}, 1

find_outputs 'my_collection', {}, 1, 100

Find from another scraper by name

find_outputs 'my_collection', {}, 1, 100, scraper_name: 'my_scraper'

Find from another scraper by job_id

find_outputs 'my_collection', {}, 1, 100, job_id: 123

Parameters:

collection (String) (defaults to: 'default') —

(‘default’) Collection name.
query (Hash) (defaults to: {}) —

({}) Filters to query.
page (Integer) (defaults to: 1) —

(1) Page number.
per_page (Integer) (defaults to: 100) —

(100) Page size.
opts (Hash) (defaults to: {}) —

({}) Configuration options.

Options Hash (opts):

:scraper_name (String|nil) — default: nil —

Scraper name to query from.
:job_id (Integer|nil) — default: nil —

Job’s id to query from.

Returns:

(Array)

Raises:

(ArgumentError) —

collection is not String.
(ArgumentError) —

query is not a Hash.
(ArgumentError) —

page is not an Integer greater than 0.
(ArgumentError) —

per_page is not an Integer between 1 and 500.

# File 'lib/answersengine/scraper/executor.rb', line 140

def find_outputs(collection='default', query={}, page=1, per_page=100, opts = {})
  # Validate parameters out from nil for easier user usage.
  raise ArgumentError.new("collection needs to be a String") unless collection.is_a?(String)
  raise ArgumentError.new("query needs to be a Hash, instead of: #{query}") unless query.is_a?(Hash)
  unless page.is_a?(Integer) && page > 0
    raise ArgumentError.new("page needs to be an Integer greater than 0")
  end
  unless per_page.is_a?(Integer) && per_page > 0 && per_page <= MAX_FIND_OUTPUTS_PER_PAGE
    raise ArgumentError.new("per_page needs to be an Integer between 1 and #{MAX_FIND_OUTPUTS_PER_PAGE}")
  end

  options = {
    query: query,
    page: page,
    per_page: per_page}

  # Get job_id
  query_job_id = opts[:job_id] || get_job_id(opts[:scraper_name], self.job_id)

  client = Client::JobOutput.new(options)
  response = client.all(query_job_id, collection)

  if response.code != 200
    raise "response_code: #{response.code}|#{response.parsed_response}"
  end
  (response.body != 'null') ? response.parsed_response : []
end

#finisher_update(options = {}) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 54

def finisher_update(options={})
  client = Client::Job.new()
  job_id = options.fetch(:job_id)

  client.finisher_update(job_id, options)
end

#get_content(gid) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 66

def get_content(gid)
  client = Client::GlobalPage.new()
  content_json = client.find_content(gid)

  if content_json['available']
    signed_url = content_json['signed_url']
    Client::BackblazeContent.new.get_gunzipped_content(signed_url)
  else
    nil
  end
end

#get_failed_content(gid) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 78

def get_failed_content(gid)
  client = Client::GlobalPage.new()
  content_json = client.find_failed_content(gid)

  if content_json['available']
    signed_url = content_json['signed_url']
    Client::BackblazeContent.new.get_gunzipped_content(signed_url)
  else
    nil
  end
end

#get_job_id(scraper_name, default = nil) ⇒ `Object`

Get current job id from scraper or default when scraper_name is null.

Parameters:

scraper_name (String|nil) —

Scraper name.
default (Integer|nil) (defaults to: nil) —

(nil) Default job id when no scraper name.

Raises:

(Exception) —

When scraper name is not null, and scraper doesn’t exists or it has no current job.

# File 'lib/answersengine/scraper/executor.rb', line 97

def get_job_id scraper_name, default = nil
  return default if scraper_name.nil?
  job = Client::ScraperJob.new().find(scraper_name)
  raise JSON.pretty_generate(job) if job['id'].nil?
  job['id']
end

#init_global_page ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 61

def init_global_page()
  client = Client::GlobalPage.new()
  client.find(gid)
end

#init_job_page ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 28

def init_job_page()
  client = Client::JobPage.new()
  job_page = client.find(job_id, gid)
  unless job_page.code == 200
    raise "Job #{job_id} or GID #{gid} not found. Aborting execution!"
  else
    job_page
  end

end

#init_page ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 17

def init_page()
  if job_id
    puts "getting Job Page"
    init_job_page
  else
    puts "getting Global Page"
    init_global_page()
  end

end

#parsing_update(options = {}) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 39

def parsing_update(options={})
  client = Client::JobPage.new()
  job_id = options.fetch(:job_id)
  gid = options.fetch(:gid)

  client.parsing_update(job_id, gid, options)
end

#remove_old_dups!(list, key_defaults) ⇒ `Integer`

Remove dups by prioritizing the latest dup.

Parameters:

list (Array) —

List of hashes to dedup.
key_defaults (Hash) —

Key and default value pair hash to use on uniq validation.

Returns:

(Integer) —

Removed duplicated items count.

# File 'lib/answersengine/scraper/executor.rb', line 208

def remove_old_dups!(list, key_defaults)
  raw_count = list.count
  keys = key_defaults.keys
  force_uniq = 0
  list.reverse!.uniq! do |item|
    # Extract stringify keys as hash
    key_hash = Hash[item.map{|k,v|keys.include?(k.to_s) ? [k.to_s,v] : nil}.select{|i|!i.nil?}]

    # Apply defaults for uniq validation
    key_defaults.each{|k,v| key_hash[k] = v if key_hash[k].nil?}

    # Don't dedup nil key defaults
    skip_dedup = !keys.find{|k| key_hash[k].nil?}.nil?
    skip_dedup ? (force_uniq += 1) : key_hash
  end
  list.reverse!
  dup_count = raw_count - list.count
  dup_count
end

#remove_old_output_dups!(list) ⇒ `Integer`

Remove dups by prioritizing the latest dup.

Parameters:

list (Array) —

List of outputs to dedup.

Returns:

(Integer) —

Removed duplicated items count.

# File 'lib/answersengine/scraper/executor.rb', line 248

def remove_old_output_dups!(list)
  key_defaults = {
    '_id' => nil,
    '_collection' => 'default'
  }
  remove_old_dups! list, key_defaults
end

#remove_old_page_dups!(list) ⇒ `Integer`

Note:

It will not dedup for now as it is hard to build gid. TODO: Build gid so we can dedup

Remove page dups by prioritizing the latest dup.

Parameters:

list (Array) —

List of pages to dedup.

Returns:

(Integer) —

Removed duplicated items count.

# File 'lib/answersengine/scraper/executor.rb', line 236

def remove_old_page_dups!(list)
  key_defaults = {
    'gid' => nil
  }
  remove_old_dups! list, key_defaults
end

#save_outputs(outputs = []) ⇒ `Object`

Note:

IMPORTANT: outputs array’s elements will be removed.

Saves outputs from an array and clear it.

Parameters:

outputs (Array) (defaults to: []) —

([]) Output array to save. Warning: all elements will be removed from the array.



342
343
344

# File 'lib/answersengine/scraper/executor.rb', line 342

def save_outputs(outputs=[])
  save_pages_and_outputs([], outputs, save_type)
end

#save_pages(pages = []) ⇒ `Object`

Note:

IMPORTANT: pages array’s elements will be removed.

Saves pages from an array and clear it.

Parameters:

pages (Array) (defaults to: []) —

([]) Page array to save. Warning: all elements will be removed from the array.



332
333
334

# File 'lib/answersengine/scraper/executor.rb', line 332

def save_pages(pages=[])
  save_pages_and_outputs(pages, [], save_type)
end

#save_pages_and_outputs(pages = [], outputs = [], status) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 256

def save_pages_and_outputs(pages = [], outputs = [], status)
  total_pages = pages.count
  total_outputs = outputs.count
  records_per_slice = 100
  until pages.empty? && outputs.empty?
    pages_slice = pages.shift(records_per_slice)
    pages_dup_count = remove_old_page_dups! pages_slice
    outputs_slice = outputs.shift(records_per_slice)
    outputs_dup_count = remove_old_output_dups! outputs_slice

    log_msgs = []
    unless pages_slice.empty?
      page_dups_ignored = pages_dup_count > 0 ? " (#{pages_dup_count} dups ignored)" : ''
      log_msgs << "#{pages_slice.count} out of #{total_pages} Pages#{page_dups_ignored}"
      unless save
        puts '----------------------------------------'
        puts "Would have saved #{log_msgs.last}#{page_dups_ignored}"
        puts JSON.pretty_generate pages_slice
      end
    end

    unless outputs_slice.empty?
      output_dups_ignored = outputs_dup_count > 0 ? " (#{outputs_dup_count} dups ignored)" : ''
      log_msgs << "#{outputs_slice.count} out of #{total_outputs} Outputs#{output_dups_ignored}"
      unless save
        puts '----------------------------------------'
        puts "Would have saved #{log_msgs.last}#{output_dups_ignored}"
        puts JSON.pretty_generate outputs_slice
      end
    end

    next unless save
    log_msg = "Saving #{log_msgs.join(' and ')}."
    puts "#{log_msg}"

    # saving to server
    response = update_to_server(
      job_id: job_id,
      gid: gid,
      pages: pages_slice,
      outputs: outputs_slice,
      status: status)

    if response.code == 200
      log_msg = "Saved."
      puts "#{log_msg}"
    else
      puts "Error: Unable to save Pages and/or Outputs to server: #{response.body}"
      raise "Unable to save Pages and/or Outputs to server: #{response.body}"
    end
  end
end

#save_type ⇒ `Object`

Raises:

(NotImplementedError)



322
323
324

# File 'lib/answersengine/scraper/executor.rb', line 322

def save_type
  raise NotImplementedError.new('Need to implement "save_type" method.')
end

#seeding_update(options = {}) ⇒ `Object`

# File 'lib/answersengine/scraper/executor.rb', line 47

def seeding_update(options={})
  client = Client::Job.new()
  job_id = options.fetch(:job_id)

  client.seeding_update(job_id, options)
end

#update_to_server(opts = {}) ⇒ `Object`



309
310
311

# File 'lib/answersengine/scraper/executor.rb', line 309

def update_to_server(opts = {})
  raise "Implemented in Subclass"
end

Class: AnswersEngine::Scraper::Executor Abstract

Overview

Direct Known Subclasses

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Plugin::ContextExposer

Instance Attribute Details

#filename ⇒ Object

#gid ⇒ Object

#job_id ⇒ Object

Instance Method Details

#clean_backtrace(backtrace) ⇒ Object

#eval_with_context(file_path, context) ⇒ Object

#exec_parser(save = false) ⇒ Object

#find_output(collection = 'default', query = {}, opts = {}) ⇒ Hash|nil

Examples:

Find from another scraper by name

Find from another scraper by job_id

#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ Array

Examples:

Find from another scraper by name

Find from another scraper by job_id

#finisher_update(options = {}) ⇒ Object

#get_content(gid) ⇒ Object

#get_failed_content(gid) ⇒ Object

#get_job_id(scraper_name, default = nil) ⇒ Object

#init_global_page ⇒ Object

#init_job_page ⇒ Object

#init_page ⇒ Object

#parsing_update(options = {}) ⇒ Object

#remove_old_dups!(list, key_defaults) ⇒ Integer

#remove_old_output_dups!(list) ⇒ Integer

#remove_old_page_dups!(list) ⇒ Integer

#save_outputs(outputs = []) ⇒ Object

#save_pages(pages = []) ⇒ Object

#save_pages_and_outputs(pages = [], outputs = [], status) ⇒ Object

#save_type ⇒ Object

#seeding_update(options = {}) ⇒ Object

#update_to_server(opts = {}) ⇒ Object

#filename ⇒ `Object`

#gid ⇒ `Object`

#job_id ⇒ `Object`

#clean_backtrace(backtrace) ⇒ `Object`

#eval_with_context(file_path, context) ⇒ `Object`

#exec_parser(save = false) ⇒ `Object`

#find_output(collection = 'default', query = {}, opts = {}) ⇒ `Hash|nil`

#find_outputs(collection = 'default', query = {}, page = 1, per_page = 100, opts = {}) ⇒ `Array`

#finisher_update(options = {}) ⇒ `Object`

#get_content(gid) ⇒ `Object`

#get_failed_content(gid) ⇒ `Object`

#get_job_id(scraper_name, default = nil) ⇒ `Object`

#init_global_page ⇒ `Object`

#init_job_page ⇒ `Object`

#init_page ⇒ `Object`

#parsing_update(options = {}) ⇒ `Object`

#remove_old_dups!(list, key_defaults) ⇒ `Integer`

#remove_old_output_dups!(list) ⇒ `Integer`

#remove_old_page_dups!(list) ⇒ `Integer`

#save_outputs(outputs = []) ⇒ `Object`

#save_pages(pages = []) ⇒ `Object`

#save_pages_and_outputs(pages = [], outputs = [], status) ⇒ `Object`

#save_type ⇒ `Object`

#seeding_update(options = {}) ⇒ `Object`

#update_to_server(opts = {}) ⇒ `Object`