Class: Pollex::Scraper
- Inherits:
-
Object
- Object
- Pollex::Scraper
- Includes:
- Singleton
- Defined in:
- lib/pollex/scraper.rb
Overview
Singleton object for scraping Pollex, caching the results, and extracting data.
Instance Attribute Summary collapse
-
#verbose ⇒ Object
Returns the value of attribute verbose.
Instance Method Summary collapse
-
#get(path, attr_infos) ⇒ Array<Symbol, String>
Gets arbitrary data from a page, with optional post-processing.
-
#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...
Gets all elements from a table within a page, with optional post-processing.
-
#initialize ⇒ Scraper
constructor
Instantiates a cache of size 100 for storing scraped pages.
-
#open_with_cache(path) ⇒ Nokogiri::HTML::Document
Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.
Constructor Details
#initialize ⇒ Scraper
Instantiates a cache of size 100 for storing scraped pages.
9 10 11 12 |
# File 'lib/pollex/scraper.rb', line 9 def initialize() @cache = LRUCache.new(:max_size => 100, :default => nil) @verbose = false end |
Instance Attribute Details
#verbose ⇒ Object
Returns the value of attribute verbose.
6 7 8 |
# File 'lib/pollex/scraper.rb', line 6 def verbose @verbose end |
Instance Method Details
#get(path, attr_infos) ⇒ Array<Symbol, String>
Gets arbitrary data from a page, with optional post-processing.
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/pollex/scraper.rb', line 47 def get(path, attr_infos) page = open_with_cache(path) contents = page.css('#content') attrs = {} attr_infos.each do |name, xpath, post_processor| attrs[name] = '' if xpath attrs[name] = contents.at_xpath(xpath).to_s.strip end if post_processor attrs[name] = post_processor.call(attrs[name]) end end attrs end |
#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...
Gets all elements from a table within a page, with optional post-processing. The results are returned as either an array of key-value pairs or as an array of objects, if a klass is specifed. If more than one page of results is found, the first page of results is returned as a PaginatedArray.
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
# File 'lib/pollex/scraper.rb', line 87 def get_all(klass, path, attr_infos, table_num = 0) page = open_with_cache(path) rows = page.css('table')[table_num].css('tr') objs = rows[1..-1].map do |row| attrs = {} attr_infos.each do |name, xpath, post_processor| attrs[name] = '' if xpath attrs[name] = row.at_xpath(xpath).to_s.strip end if post_processor attrs[name] = post_processor.call(attrs[name]) end end attrs end # check if there is a "next" page last_link = page.css('.pagination a').last() if last_link and last_link.text()[0..3] == 'Next' last_link_path = last_link.attributes()['href'] new_path = path.split('?')[0] + last_link_path results = PaginatedArray.new() results.query = {:klass => klass, :attr_infos => attr_infos, :table_num => table_num} results.next_page = new_path results.concat(objs.to_a) # merge rather than create new array else results = objs end if klass results.map! {|x| klass.new(x) } end results end |
#open_with_cache(path) ⇒ Nokogiri::HTML::Document
Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/pollex/scraper.rb', line 18 def open_with_cache(path) if @cache[path] if @verbose puts "Opening cached contents of http://pollex.org.nz#{path} ..." end @cache[path] else if @verbose puts "Connecting to http://pollex.org.nz#{path} ..." end page = Nokogiri::HTML(open("http://pollex.org.nz#{path}")) @cache[path] = page page end end |