Class: Pollex::Scraper

Inherits:
Object
  • Object
show all
Includes:
Singleton
Defined in:
lib/pollex/scraper.rb

Overview

Singleton object for scraping Pollex, caching the results, and extracting data.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeScraper

Instantiates a cache of size 100 for storing scraped pages.



9
10
11
12
# File 'lib/pollex/scraper.rb', line 9

def initialize()
  @cache = LRUCache.new(:max_size => 100, :default => nil)
  @verbose = false
end

Instance Attribute Details

#verboseObject

Returns the value of attribute verbose.



6
7
8
# File 'lib/pollex/scraper.rb', line 6

def verbose
  @verbose
end

Instance Method Details

#get(path, attr_infos) ⇒ Array<Symbol, String>

Gets arbitrary data from a page, with optional post-processing.

Examples:

Return information about the level of a given reconstruction

Scraper.instance.get(@reconstruction_path, [
  [:level_token, "table[1]/tr[2]/td/a/text()", lambda {|x| x.split(':')[0]}],
  [:level_path, "table[1]/tr[2]/td/a/@href"]
])

Parameters:

  • path (String)

    relative path from http://pollex.org.nz

  • attr_infos (Array<Array<Symbol, String, (Proc, nil)>>)

    an array that, for each element to be scraped, contains an array of:

    • a key for the element

    • the XPath to the element, from the div#content tag of the page

    • (optionally) a Proc to be performed on the element’s contents

Returns:

  • (Array<Symbol, String>)

    array of key-value pairs



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/pollex/scraper.rb', line 47

def get(path, attr_infos)
  page = open_with_cache(path)
  contents = page.css('#content')

  attrs = {}
  attr_infos.each do |name, xpath, post_processor|
    attrs[name] = ''
    if xpath
      attrs[name] = contents.at_xpath(xpath).to_s.strip
    end
    if post_processor
      attrs[name] = post_processor.call(attrs[name])
    end
  end
  attrs
end

#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...

Gets all elements from a table within a page, with optional post-processing. The results are returned as either an array of key-value pairs or as an array of objects, if a klass is specifed. If more than one page of results is found, the first page of results is returned as a PaginatedArray.

Examples:

Return an array of all SemanticFields in Pollex

Scraper.instance.get_all(SemanticField, "/category/", [
  [:id, 'td[1]/a/text()'],
  [:path, 'td[1]/a/@href'],
  [:name, 'td[2]/a/text()'],
  [:count, 'td[3]/text()']
])

Parameters:

  • klass (Class)

    (optional) class of objects to be instantiated

  • path (String)

    relative path from http://pollex.org.nz

  • attr_infos (Array<Array<Symbol, String, (Proc, nil)>>)

    an array that, for each element to be scraped, contains an array of:

    • a key for the element

    • the XPath to the element, from a given table

    • (optionally) a Proc to be performed on the element’s contents

  • table_num (Integer) (defaults to: 0)

    the number of the table on the page to process (default: 0 - that is, the first table on the page)

Returns:

  • (Array<klass>)

    if one page of results was found

  • (PaginatedArray<klass>)

    if multiple pages of results were found

  • (Array<Array<Symbol, String>>)

    if no klass is specified



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# File 'lib/pollex/scraper.rb', line 87

def get_all(klass, path, attr_infos, table_num = 0)
  page = open_with_cache(path)

  rows = page.css('table')[table_num].css('tr')
  objs = rows[1..-1].map do |row|
    attrs = {}
    attr_infos.each do |name, xpath, post_processor|
      attrs[name] = ''
      if xpath
        attrs[name] = row.at_xpath(xpath).to_s.strip
      end
      if post_processor
        attrs[name] = post_processor.call(attrs[name])
      end
    end
    attrs
  end

  # check if there is a "next" page
  last_link = page.css('.pagination a').last()
  if last_link and last_link.text()[0..3] == 'Next'
    last_link_path = last_link.attributes()['href']
    new_path = path.split('?')[0] + last_link_path

    results = PaginatedArray.new()
    results.query = {:klass => klass, :attr_infos => attr_infos, :table_num => table_num}
    results.next_page = new_path
    results.concat(objs.to_a) # merge rather than create new array
  else
    results = objs
  end

  if klass
    results.map! {|x| klass.new(x) }
  end

  results
end

#open_with_cache(path) ⇒ Nokogiri::HTML::Document

Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.

Parameters:

  • path (String)

    relative path from http://pollex.org.nz

Returns:

  • (Nokogiri::HTML::Document)

    the requested page, parsed with Nokogiri



18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/pollex/scraper.rb', line 18

def open_with_cache(path)
  if @cache[path]
    if @verbose
      puts "Opening cached contents of http://pollex.org.nz#{path} ..."
    end
    @cache[path]
  else
    if @verbose
      puts "Connecting to http://pollex.org.nz#{path} ..."
    end
    page = Nokogiri::HTML(open("http://pollex.org.nz#{path}"))
    @cache[path] = page
    page
  end
end