Class: Scrubyt::Extractor

Inherits:

Object

Object
Scrubyt::Extractor

Includes:: FetchAction

Defined in:: lib/scrubyt/core/shared/extractor.rb

Overview

`Driving the whole extraction process`

Extractor is a performer class - it gets an extractor definition and carries out the actions and evaluates the wrappers sequentially.

Originally also the navigation actions were here, but since the class got too big, they were factored out to an own class, NavigationAction.

Instance Attribute Summary collapse

#evaluating_extractor_definition ⇒ Object

, :hpricot_doc, :current_doc_url.
#mode ⇒ Object

, :hpricot_doc, :current_doc_url.
#next_page_pattern ⇒ Object

, :hpricot_doc, :current_doc_url.
#result ⇒ Object

, :hpricot_doc, :current_doc_url.
#root_patterns ⇒ Object

, :hpricot_doc, :current_doc_url.

Class Method Summary collapse

.define(mode = nil, &extractor_definition) ⇒ Object

The definition of the extractor is passed through this method.
.load(filename) ⇒ Object

Instance Method Summary collapse

#add_to_next_page_list(result_node) ⇒ Object
#evaluate_extractor ⇒ Object
#get_current_doc_url ⇒ Object
#get_detail_pattern_relations ⇒ Object
#get_hpricot_doc ⇒ Object
#get_mode ⇒ Object
#get_original_host_name ⇒ Object
#initialize(mode, extractor_definition) ⇒ Extractor constructor

A new instance of Extractor.

Methods included from FetchAction

extractor, extractor=, get_current_doc_url, #get_host_name, get_hpricot_doc, get_mechanize_doc, #restore_host_name, #restore_page, #store_host_name, #store_page

Constructor Details

#initialize(mode, extractor_definition) ⇒ `Extractor`

Returns a new instance of Extractor.

# File 'lib/scrubyt/core/shared/extractor.rb', line 40

def initialize(mode, extractor_definition)
  @mode = mode
  @root_patterns = []
  @next_page_pattern = nil
  #      @hpricot_doc = nil
  #      @hpricot_doc_url = nil
  @evaluating_extractor_definition = false
  @next_page_list = []
  @processed_pages = []
  
  backtrace = SharedUtils.get_backtrace
  parts = backtrace[1].split(':')
  source_file = parts[0]
  
  Scrubyt.log :MODE, mode == :production ? 'Production' : 'Learning'
  
  @evaluating_extractor_definition = true
  context = Object.new
  context.extend NavigationActions
  context.instance_eval do
    def extractor=(value)
      @extractor = value
    end
    
    def next_page(*args)
      @extractor.next_page_pattern = Scrubyt::Pattern.new('next_page', args, @extractor)
    end
    
    def method_missing(method_name, *args, &block)
      root_pattern = Scrubyt::Pattern.new(method_name.to_s, args, @extractor, nil, &block)
      @extractor.root_patterns << root_pattern
      root_pattern
    end
  end
  FetchAction.extractor = self
  context.extractor = self
  context.instance_eval(&extractor_definition)
  @evaluating_extractor_definition = false
  
  if @root_patterns.empty?
    # TODO: this should be an exception
    Scrubyt.log :ERROR, 'No extractor defined, exiting...'
    exit
  end
  
  #Once all is set up, evaluate the extractor from the root pattern!
  root_results = evaluate_extractor
  FetchAction.close_firefox if @mode.is_a?(Hash) && @mode[:close]

  
  @result = ScrubytResult.new('root')
  @result.push(*@root_results)
  @result.root_patterns = @root_patterns
  @result.source_file = source_file
  @result.source_proc = extractor_definition
  
  #Return the root pattern
  Scrubyt.log :INFO, 'Extraction finished succesfully!'
end