Class: Pollex::Scraper

Inherits:

Object

Object
Pollex::Scraper

show all

Includes:: Singleton

Defined in:: lib/pollex/scraper.rb

Overview

Singleton object for scraping Pollex, caching the results, and extracting data.

Instance Attribute Summary collapse

#verbose ⇒ Object

Returns the value of attribute verbose.

Instance Method Summary collapse

#get(path, attr_infos) ⇒ Array<Symbol, String>

Gets arbitrary data from a page, with optional post-processing.
#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...

Gets all elements from a table within a page, with optional post-processing.
#initialize ⇒ Scraper constructor

Instantiates a cache of size 100 for storing scraped pages.
#open_with_cache(path) ⇒ Nokogiri::HTML::Document

Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.

Constructor Details

#initialize ⇒ `Scraper`

Instantiates a cache of size 100 for storing scraped pages.

# File 'lib/pollex/scraper.rb', line 9

def initialize()
  @cache = LRUCache.new(:max_size => 100, :default => nil)
  @verbose = false
end

Instance Attribute Details

#verbose ⇒ `Object`

Returns the value of attribute verbose.



6
7
8

# File 'lib/pollex/scraper.rb', line 6

def verbose
  @verbose
end

Instance Method Details

#get(path, attr_infos) ⇒ `Array<Symbol, String>`

Gets arbitrary data from a page, with optional post-processing.

Examples:

Return information about the level of a given reconstruction

Scraper.instance.get(@reconstruction_path, [
  [:level_token, "table[1]/tr[2]/td/a/text()", lambda {|x| x.split(':')[0]}],
  [:level_path, "table[1]/tr[2]/td/a/@href"]
])

Parameters:

path (String) —

relative path from http://pollex.org.nz
attr_infos (Array<Array<Symbol, String, (Proc, nil)>>) —
an array that, for each element to be scraped, contains an array of:
- a key for the element
- the XPath to the element, from the div#content tag of the page
- (optionally) a Proc to be performed on the element’s contents

Returns:

(Array<Symbol, String>) —

array of key-value pairs

# File 'lib/pollex/scraper.rb', line 47

def get(path, attr_infos)
  page = open_with_cache(path)
  contents = page.css('#content')

  attrs = {}
  attr_infos.each do |name, xpath, post_processor|
    attrs[name] = ''
    if xpath
      attrs[name] = contents.at_xpath(xpath).to_s.strip
    end
    if post_processor
      attrs[name] = post_processor.call(attrs[name])
    end
  end
  attrs
end

#get_all(klass, path, attr_infos, table_num = 0) ⇒ `Array<klass>`, ...

Gets all elements from a table within a page, with optional post-processing. The results are returned as either an array of key-value pairs or as an array of objects, if a klass is specifed. If more than one page of results is found, the first page of results is returned as a PaginatedArray.

Examples:

Return an array of all SemanticFields in Pollex

Scraper.instance.get_all(SemanticField, "/category/", [
  [:id, 'td[1]/a/text()'],
  [:path, 'td[1]/a/@href'],
  [:name, 'td[2]/a/text()'],
  [:count, 'td[3]/text()']
])

Parameters:

klass (Class) —

(optional) class of objects to be instantiated
path (String) —

relative path from http://pollex.org.nz
attr_infos (Array<Array<Symbol, String, (Proc, nil)>>) —
an array that, for each element to be scraped, contains an array of:
- a key for the element
- the XPath to the element, from a given table
- (optionally) a Proc to be performed on the element’s contents
table_num (Integer) (defaults to: 0) —

the number of the table on the page to process (default: 0 - that is, the first table on the page)

Returns:

(Array<klass>) —

if one page of results was found
(PaginatedArray<klass>) —

if multiple pages of results were found
(Array<Array<Symbol, String>>) —

if no klass is specified

# File 'lib/pollex/scraper.rb', line 87

def get_all(klass, path, attr_infos, table_num = 0)
  page = open_with_cache(path)

  rows = page.css('table')[table_num].css('tr')
  objs = rows[1..-1].map do |row|
    attrs = {}
    attr_infos.each do |name, xpath, post_processor|
      attrs[name] = ''
      if xpath
        attrs[name] = row.at_xpath(xpath).to_s.strip
      end
      if post_processor
        attrs[name] = post_processor.call(attrs[name])
      end
    end
    attrs
  end

  # check if there is a "next" page
  last_link = page.css('.pagination a').last()
  if last_link and last_link.text()[0..3] == 'Next'
    last_link_path = last_link.attributes()['href']
    new_path = path.split('?')[0] + last_link_path

    results = PaginatedArray.new()
    results.query = {:klass => klass, :attr_infos => attr_infos, :table_num => table_num}
    results.next_page = new_path
    results.concat(objs.to_a) # merge rather than create new array
  else
    results = objs
  end

  if klass
    results.map! {|x| klass.new(x) }
  end

  results
end

#open_with_cache(path) ⇒ `Nokogiri::HTML::Document`

Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.

Parameters:

path (String) —

relative path from http://pollex.org.nz

Returns:

(Nokogiri::HTML::Document) —

the requested page, parsed with Nokogiri

# File 'lib/pollex/scraper.rb', line 18

def open_with_cache(path)
  if @cache[path]
    if @verbose
      puts "Opening cached contents of http://pollex.org.nz#{path} ..."
    end
    @cache[path]
  else
    if @verbose
      puts "Connecting to http://pollex.org.nz#{path} ..."
    end
    page = Nokogiri::HTML(open("http://pollex.org.nz#{path}"))
    @cache[path] = page
    page
  end
end

Class: Pollex::Scraper

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ Scraper

Instance Attribute Details

#verbose ⇒ Object

Instance Method Details

#get(path, attr_infos) ⇒ Array<Symbol, String>

Examples:

Return information about the level of a given reconstruction

#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...

Examples:

Return an array of all SemanticFields in Pollex

#open_with_cache(path) ⇒ Nokogiri::HTML::Document

#initialize ⇒ `Scraper`

#verbose ⇒ `Object`

#get(path, attr_infos) ⇒ `Array<Symbol, String>`

#get_all(klass, path, attr_infos, table_num = 0) ⇒ `Array<klass>`, ...

#open_with_cache(path) ⇒ `Nokogiri::HTML::Document`