Class: ExtraLoop::IterativeScraper
- Inherits:
-
ScraperBase
- Object
- ScraperBase
- ExtraLoop::IterativeScraper
- Defined in:
- lib/extraloop/iterative_scraper.rb
Defined Under Namespace
Modules: Exceptions
Instance Attribute Summary
Attributes inherited from ScraperBase
Instance Method Summary collapse
-
#continue_with(param, *extractor_args, &block) ⇒ Object
Public.
-
#initialize(urls, options = {}, arguments = {}) ⇒ IterativeScraper
constructor
Public.
- #run ⇒ Object
-
#set_iteration(param, *args, &block) ⇒ Object
Public.
Methods inherited from ScraperBase
#base_initialize, #extract, #loop_on
Methods included from Utils::Support
Methods included from Hookable
Constructor Details
#initialize(urls, options = {}, arguments = {}) ⇒ IterativeScraper
Public
Initializes an iterative scraper (i.e. a scraper which can extract data from a list of several web pages).
urls - One or an array of several urls. options - A hash of scraper options (optional).
async : Wether or not the scraper should issue HTTP requests synchronously or asynchronously (defaults to false).
log : Logging options (set to false to completely suppress logging).
hydra : A list of arguments to be passed in when initializing the HTTP queue (see Typheous#Hydra).
arguments - Hash of arguments to be passed to the Typhoeus HTTP client (optional).
Examples:
# Iterates over the first 10 pages of Google News search result for the query ‘Egypt’.
IterativeScraper.new(“www.google.com/search?tbm=nws&q=Egypt”, :log => {
:appenders => [ 'example.log', :stderr],
:log_level => :debug
}).set_iteration(:start, (1..101).step(10))
# Iterates over the first 10 pages of Google News search results for the query ‘Egypt’ first, and then # for the query ‘Syria’, issuing HTTP requests asynchronously, and ignoring ssl certificate verification.
IterativeScraper.new([
https://www.google.com/search?tbm=nws&q=Egypt",
https://www.google.com/search?tbm=nws&q=Syria"
], {:async => true, }, {:disable_ssl_peer_verification => true
}).set_iteration(:start, (1..101).step(10))
Returns itself.
43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/extraloop/iterative_scraper.rb', line 43 def initialize(urls, = {}, arguments = {}) super([], , arguments) @base_urls = Array(urls) @iteration_set = [] @iteration_extractor = nil @iteration_extractor_args = nil @iteration_count = 0 @iteration_param = nil @iteration_param_value = nil @continue_clause_args = nil self end |
Instance Method Details
#continue_with(param, *extractor_args, &block) ⇒ Object
Public
Builds an extractor and uses it to set the value of the next iteration’s offset parameter. If the extractor returns nil, the iteration stops.
param - A symbol identifying the itertion parameter name. extractor_args - Arguments to be passed to the extractor which will be used to evaluate the continue value
Returns itself.
113 114 115 116 117 118 119 120 |
# File 'lib/extraloop/iterative_scraper.rb', line 113 def continue_with(param, *extractor_args, &block) extractor_args << block if block raise Exceptions::NonGetAsyncRequestNotYetImplemented.new "the #continue_with method currently requires the 'async' option to be set to false" if @options[:async] @continue_clause_args = extractor_args set_iteration_param(param) self end |
#run ⇒ Object
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# File 'lib/extraloop/iterative_scraper.rb', line 122 def run @base_urls.each do |base_url| # run an extra iteration to determine the value of the next offset parameter (if #continue_with is used) # or the entire iteration set (if #set_iteration is used). (run_iteration(base_url); @iteration_count += 1 ) if @iteration_extractor_args || @continue_clause_args while @iteration_set.at(@iteration_count) method = @options[:async] ? :run_iteration_async : :run_iteration send(method, base_url) @iteration_count += 1 end #reset all counts @queued_count = 0 @response_count = 0 @iteration_count = 0 end self end |
#set_iteration(param, *args, &block) ⇒ Object
Public
Specifies the collection of values over which the scraper should iterate. At each iteration, the current value in the iteration set will be included as part of the request parameters.
param - the name of the iteration parameter. args - Either an array of values, or a set the arguments to initialize an Extractor object.
Examples:
# Explicitly specify the iteration set (can be either a range or an array).
IterativeScraper.new("http://my-site.com/events").
set_iteration(:p, 1..10).
# Pass in a code block to dynamically extract the iteration set from the document.
# The code block will be passed to generate an Extractor that will be run at the first
# iteration. The iteration will not continue if the proc will return return a non empty
# set of values.
fetch_page_numbers = proc { |elements|
elements.map { |a|
a.attr(:href).match(/p=(\d+)/)
$1
}.reject { |p| p == 1 }
}
IterativeScraper.new("http://my-site.com/events").
set_iteration(:p, "div#pagination a", fetch_page_numbers)
Returns itself.
92 93 94 95 96 97 98 99 100 101 |
# File 'lib/extraloop/iterative_scraper.rb', line 92 def set_iteration(param, *args, &block) args << block if block if args.first.respond_to?(:map) @iteration_set = Array(args.first).map &:to_s else @iteration_extractor_args = [:pagination, *args] end set_iteration_param(param) self end |